Giter Club home page Giter Club logo

Comments (31)

FlorianRappl avatar FlorianRappl commented on July 30, 2024

I will investigate these two.

On the first one it might be a little bit problematic - the Windows Store HTTP requester might be odd. It usually works, therefore I assume it is a problem in the particular .NET subset available in the WinRT implementation. I cannot promise to come up with a good solution, but I will definitely try to find out what the source of the problem is. Maybe there is an elegant way around it.

The second one seems odd in the example. Especially since you are already giving AngleSharp a string. AngleSharp does not do anything regarding encoding when confronted with a string. Have you tried passing in the raw bytes in form of a stream (e.g. MemoryStream)? I am also curious if the bug may be already fixed with the current version (from the repo - not available on GitHub). These 2 (raw bytes + latest source) pieces of information would be very helpful for finding a solution.

Thanks!

from anglesharp.

silverbird avatar silverbird commented on July 30, 2024

Thanks for the quick reply! Just as what you've mentioned, there is really a way around the two 'problems' by removing the encoding parameter. Following is the working code:

    static async void Test3()
    {
        var http = new HttpClient();
        var request = await http.GetAsync(new Uri("http://item.jd.com/11312278.html"));
        var response = await request.Content.ReadAsBufferAsync();
        var document = DocumentBuilder.Html(response.AsStream());
        Debug.WriteLine (document.ToHtml());
    }

It seems that AngleSharp doesn't need specifying encoding when downloading or parsing web pages. That's kind of out of expectation, but it's amazingly good: that means I need not care about the scaring encoding any more, which will surely make the code more simple and elegant.

Great job! Thanks!

from anglesharp.

FlorianRappl avatar FlorianRappl commented on July 30, 2024

AngleSharp follows the W3C spec in that regard. It uses a default encoding (depending on the localization of the OS where AngleSharp is running), which can be overriden either explicitely or by BOM detection (which happens for every stream). However, some / most streams do not start with a BOM, which does not matter, since usually every webpage has a meta tag with the correct information. If AngleSharp is not 100% certain about the used encoding (no BOM detected, nothing explicitely given) then the encoding will be switched when encountering such a meta tag.

Therefore I am glad that this works as I thought. Strings will not get re-interpreted, as this has been proven buggy and usually unnecessary (after all it is already a .NET string, which should have come out after some appropriate Encoding.GetString() invocation). Unfortunately in the scenario you described that becomes messy (having a string first and relying on AngleSharp to reinterpret it correctly).

Alright, so 1 problem down then I need to do my homework and fix the other one regarding the requester.

from anglesharp.

silverbird avatar silverbird commented on July 30, 2024

Thanks, Florian, this detailed explaination does help a lot. Hope the fixing of the other one will go on smoothly. Best regards!

from anglesharp.

silverbird avatar silverbird commented on July 30, 2024

Still has some encoding problem. I've tested 'DocumentBuilder.Html(string pageSource)' method with more web pages, and most of them is ok. But with some page sources(charset='gb2312'), it would output some unexpected charaters(such as '??') somewhere(not everywhere). That's very odd, since most of the other Chinese characters have been output correctly.

from anglesharp.

FlorianRappl avatar FlorianRappl commented on July 30, 2024

Alright so I think I have been blind regarding the first issue. You already posted a very important hint: "System.Net.HttpWebRequest.set_UserAgent(System.String)".

This is special in the Windows Phone / Windows Store implementation of the HttpWebRequest. I don't know why this worked in the previous version, maybe I did not set the user-agent there. Version v0.8 will have fixed this (v0.8 will be released within December for sure, however, I can't say if this will be before or after Christmas).

Sorry for the troubles and thanks for reporting the issue!

from anglesharp.

FlorianRappl avatar FlorianRappl commented on July 30, 2024

Are you testing the string version again or the stream version? If it is the former, then I would say that this is due to UTF-16 and maybe expected (maybe it indicates a bug and should be changed - I would need the sources of these webpages to have a look). If it is the latter, then this should definitely be fixed.
Can you list some of the pages you are using? That would help me in reproducing the bugs and construct some test cases. Thanks!

from anglesharp.

silverbird avatar silverbird commented on July 30, 2024

Sorry, my mistake. I mean the stream version. Following is the code:

    static async void Test3()
    {
        var http = new HttpClient();
        var request = await http.GetAsync(new Uri("http://trade.500.com/bjdcsf/"));
        var response = await request.Content.ReadAsBufferAsync();
        var document = DocumentBuilder.Html(response.AsStream());
        Debug.WriteLine(document.ToHtml());
    }

Search the output result, you will find one line that contains text like this : 'lg="足球-?仿薨? rq=',
which should be 'lg="足球-欧罗巴" rq='. That's pretty odd. Seems it reinterpreted ‘欧罗巴"’ to ‘?仿薨?’ unexpectly. Thanks!

from anglesharp.

FlorianRappl avatar FlorianRappl commented on July 30, 2024

I wrote two tests (well, the first one seems to work anyway now - so I just test that the title is correct). The second test reflects the last case you described. I had a look at the page at found some of the items which should have invalid attributes. However, for me the tests pass.

The question now is: Is it due to the initial encoding of my system or due to Debug or Console output? If Debug or Console output just display (!) the character wrong, than its not much of an issue (I can't rewrite .NET's Console implementation or the Windows shell)... If its due to the initial encoding then I need to do something.

Can you run these tests and look if they succeed? That would be of great help. Thanks!

from anglesharp.

silverbird avatar silverbird commented on July 30, 2024

I can't run the test now, but I can tell that it's not due to the VS Debug or Console issue. Initially I want to use DocumentBuilder.Html(Streem) in my project to get an IHtmlDocument interface, and then retrive the data by iretate through the rows of a table. Each row contains a piece of json text, and the unexpected characters appears in it. They messed up the whole piece of json data, so when I pass the json data to my json parser(Newton.Json), it can't work, and crashes there everytime. Thanks!

from anglesharp.

FlorianRappl avatar FlorianRappl commented on July 30, 2024

Well, I don't see how a text character can break JSON parsing (it is not a control character, therefore I don't see a big issue unless the JSON parser explicitly checks for unicode range violations), but that is not the issue.

As I wrote the test works and there does not seem to be an issue. Therefore (if we can rule out the Console output) I'd like to know your initial encoding. The standard ones (UTF-8 and Windows-1252) work. You can have a look at the table in point 9) of http://www.w3.org/TR/html5/syntax.html#determining-the-character-encoding.

from anglesharp.

silverbird avatar silverbird commented on July 30, 2024

If the error-decoded characters just appear in the value part of the key-value pair, then it is sometime acceptable, but not when they 'eat' the double quote which enclose the value - this will break the structure of the json data.

The encoding of the web page is 'GB2312', it's not in the table listed on w3 website. However this encoding is not a problem in the older version of AngleSharp(ver 0.50 or earlier) - I can first decode the stream to 'gb2312', and then pass the decoded string to DocumentBuilder.Html(string), and it worked.

from anglesharp.

FlorianRappl avatar FlorianRappl commented on July 30, 2024

I am not talking about the encoding of the webpage but the encoding according to your computer. As I wrote there is an initial encoding and I am trying to figure out if changing the encoding might be the problem.

AngleSharp v0.5 used a different algorithm which a.) did not allow to modify the stream and b.) was not according to W3C specification. Therefore there is no way in switching back.

Again: What is YOUR encoding (not the encoding of the webpage) according to the table given in the provided link?

from anglesharp.

silverbird avatar silverbird commented on July 30, 2024

My system default encoding is gb2312. The output of this line

Debug.WriteLine(System.Text.Encoding.Default.WebName);

is 'gb2312'. Thanks!

from anglesharp.

FlorianRappl avatar FlorianRappl commented on July 30, 2024

Hm that's not what I meant. That is the encoding chosen from .NET - but I am asking for the initial encoding according to W3C specification (therefore the table). You can also get the suggestion from AngleSharp by using DocumentEncoding.Suggest(CultureInfo.CurrentCulture.Name).

Please note that the class is internal, so you will either need reflection or the sources the trigger it. This is the reason I just gave you the link to the table.

from anglesharp.

silverbird avatar silverbird commented on July 30, 2024

Sorry, my system is windows 8.1 and I don't know how to detect its default encoding. I've searched the control panel, but can't find where to check or change the default encoding.

from anglesharp.

FlorianRappl avatar FlorianRappl commented on July 30, 2024

Its not in the system its in AngleSharp. Just look at the table provided in the link:
http://www.w3.org/TR/html5/syntax.html#determining-the-character-encoding
(scroll down to point 9 and see what is the "Suggested default encoding" for your local language)

from anglesharp.

silverbird avatar silverbird commented on July 30, 2024

It's GB18030.

from anglesharp.

FlorianRappl avatar FlorianRappl commented on July 30, 2024

Alright, that's already tested. There does not seem to be any problem there, unless I am missing something.

Therefore two things I am still curious at:

  1. Have you tried the latest AngleSharp version (not v0.7 - but the latest from the sources?)
  2. Have you tried running the unit tests? [I know you couldn't before, but maybe you now have the chance...)

It could be that the bug is really dependent on the original encoding, however, that one has already been fixed in the sources (will be published with v0.8).

from anglesharp.

silverbird avatar silverbird commented on July 30, 2024

Thanks. I will try the latest version.

from anglesharp.

silverbird avatar silverbird commented on July 30, 2024

I downloaded the source code as Zip file, and run a test with the same code. Seems still have some encoding problems. Just search for this line in the output:

'class="countdown_time" title="剩余?奔? style='

which should be

'class="countdown_time" title="剩余时间" style='

That means anglesharp had incorrectly reinterpreted ' 时间" ' as ' ?奔? ' which is not acceptable.

And another error-decoded line:

fid="443861" homesxname="?寺?日"

which should be

fid="443861" homesxname="克卢日"

That's pretty odd. I guess maybe the problem is the gb2312 encoding. Maybe AngleSharp accepts the W3 suggestion which uses gb18030 as the default encoding of simple Chinese, so when we pass it gb2312 encoded Chinese characters, it can't decode them correctly sometimes.

from anglesharp.

FlorianRappl avatar FlorianRappl commented on July 30, 2024

The tests can't reproduce your issues. The expected characters are being found. I don't know how you produce those bugs, but the unit tests say that the expected values match the actual ones.

Also please don't post lines, but post reproducible (!) code in form of CSS selectors. The page is ~1.5 MB, which is more text than I would like to see. A unique selector is desired to provide a solid unit test.

from anglesharp.

silverbird avatar silverbird commented on July 30, 2024

Well, seems the problem occurs in the Windows os default settings. I've run these two line on my computer in a Windows store app project:

       var suggestedEncoding =  DocumentEncoding.Suggest(CultureInfo.CurrentCulture.Name);
       Debug.WriteLine(suggestedEncoding.WebName);

and the output is 'Windows-1252'.

from anglesharp.

FlorianRappl avatar FlorianRappl commented on July 30, 2024

Coming from Windows-1252 is already covered. At the moment I find it hard to believe that there is an actual bug. If you could write a unit test that fails (and shouldn't fail) using the data in Assets.trade_500, then I am convinced.

from anglesharp.

silverbird avatar silverbird commented on July 30, 2024

Don't know how to write a unit test, but I can come up with a piece of code snippet which can not run through in my cumpter(Windows store app project):

    static async void Test()
    {
        var http = new HttpClient();
        var request = await http.GetAsync(new Uri("http://trade.500.com/bjdcsf/"));
        var response = await request.Content.ReadAsBufferAsync();
        var document = DocumentBuilder.Html(response.AsStream());

        var bet_content = document.GetElementById("bet_content");
        var el = bet_content.GetElementsByTagName("table").Where(x => x.ClassName.Contains("bet_table"));

        foreach (var e in el)
        {
            IHtmlTableElement t = (IHtmlTableElement)e;

            foreach (var tr in t.Rows)
            {
                string awaysxname = tr.GetAttribute("awaysxname");
                string homesxname = tr.GetAttribute("homesxname");

                Debug.WriteLine(homesxname + " vs " + awaysxname);
                if (string.IsNullOrEmpty(homesxname))
                {
                    throw new Exception("Error: Home team name should exist while it's not! " + tr.ToHtml());
                }

                if (homesxname.Contains("?"))
                {
                    throw new Exception("invalid home team name: " + homesxname);
                }

                if (string.IsNullOrEmpty(awaysxname))
                {
                    throw new Exception("Error: Awat team name should exist while it's not! " + tr.ToHtml());
                }

                if (awaysxname.Contains("?"))
                {
                    throw new Exception("invalid away team name: " + awaysxname);
                }
            }
        }


    }

Could you test this code for me? If none of the four exceptions has been triggered, then probably the problem is with my os system or VS settings. Thanks!

from anglesharp.

FlorianRappl avatar FlorianRappl commented on July 30, 2024

Alright I found some errors. Right now I suspect they are due to boundaries (surrogate / multi-byte). I'll try to find a nice way around this issue. A fix will be released with v0.8.

Thanks for your time and patience!

from anglesharp.

silverbird avatar silverbird commented on July 30, 2024

Great jos! Glad to see the bug has been discovered. Looking forward to the new version. Thanks!

from anglesharp.

FlorianRappl avatar FlorianRappl commented on July 30, 2024

I am still not sure what happened, but my guess is that the decoder keeps the internal state, while directly using encoding throws it away. Anyway, first tests seem to be positive. I will include more unit tests for this.

from anglesharp.

silverbird avatar silverbird commented on July 30, 2024

I've downloaded the latest code source and it passed the test successfully.Seems the second bug has already been fixed.
NOTE: The url in the test code should be changed to http://trade.500.com/bjdcsf/?expect=41202, because the web page content of the orginal url will change every two or three days.

from anglesharp.

FlorianRappl avatar FlorianRappl commented on July 30, 2024

There is no URL used any more - everything is streamed locally. This guarantees reusability.

from anglesharp.

silverbird avatar silverbird commented on July 30, 2024

Seems the first bug has been fixed too! Really good job! Congratulations!

from anglesharp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.