Giter Club home page Giter Club logo

clj-tagsoup's Introduction

Clojars Project

clj-tagsoup

This is a HTML parser for Clojure, somewhat akin to Common Lisp's cl-html-parse. It is a wrapper around the TagSoup Java SAX parser, but has a DOM interface. It is buildable by Leiningen.

Usage

The two main functions defined by clj-tagsoup are parse and parse-string. The first one can take anything accepted by clojure.java.io's reader function except for a Reader, while the second can parse HTML from a string.

The resulting HTML tree is a vector, consisting of:

  1. a keyword representing the tag name,
  2. a map of tag attributes (mapping keywords to strings),
  3. children nodes (strings or vectors of the same format).

This is the same format as used by hiccup, thus the output of parse is appropriate to pass to hiccup.

There are also utility accessors (tag, attributes, children).

clj-tagsoup will automatically use the correct encoding to parse the file if one is specified in either the HTTP headers (if the argument to parse is an URL object or a string representing one) or a <meta http-equiv="..."> tag.

clj-tagsoup is meant to parse HTML tag soup, but, in practice, nothing prevents you to use it to parse arbitrary (potentially malformed) XML. The :xml keyword argument causes clj-tagsoup to take into consideration the XML header when detecting the encoding.

There are two other options for parsing XML:

  • parse-xml just invokes clojure.xml/parse with TagSoup, so the output format is compatible with clojure.xml and is not the one described above.
  • lazy-parse-xml (introduced in clj-tagsoup 0.3.0) returns a lazy sequence of Event records defined by clojure.data.xml, similarly to the source-seq function from that library.

Example

project.clj:

(defproject clj-tagsoup-example "0.0.1"
  :dependencies [[clj-tagsoup/clj-tagsoup "0.3.0"]])

lein repl:

(use 'pl.danieljanus.tagsoup)
=> nil

(parse "http://example.com")
=> [:html {}
          [:head {}
                 [:title {} "Example Web Page"]]
          [:body {}
                 [:p {} "You have reached this web page by typing \"example.com\",\n\"example.net\",\n  or \"example.org\" into your web browser."]
                 [:p {} "These domain names are reserved for use in documentation and are not available \n  for registration. See "
                     [:a {:shape "rect", :href "http://www.rfc-editor.org/rfc/rfc2606.txt"} "RFC \n  2606"]
                     ", Section 3."]]]

FAQ

  • Why not just use Enlive?

    Truth be told, I wrote clj-tagsoup prior to discovering Enlive, which is an excellent library. That said, I believe clj-tagsoup has its niche. Here is an ร  la carte list of differences between the two:

    • Enlive is a full-blown templating library; clj-tagsoup just parses HTML (and XML).
    • Unlike Enlive, clj-tagsoup's parse function goes out of its way to return parsed data in a proper encoding. It will detect the <meta http-equiv="..."> tag in your data and reinterpret the input stream to the indicated encoding as needed.
    • clj-tagsoup boasts a way to lazily parse XML with TagSoup.
  • What's with the dependency on stax-utils?

    It's for lazy-parse-xml. It's needed because that function uses clojure.data.xml, which under the hood uses the StAX API. TagSoup is a SAX parser, so a bridge between the two parsing APIs is needed.

    If you don't use lazy-parse-xml, you can optionally exclude stax-utils from your project.clj, like this:

     :dependencies [[clj-tagsoup "0.3.0" :exclusions [net.java.dev.stax-utils/stax-utils]]]
    

Author

clj-tagsoup was written by Daniel Janus.

clj-tagsoup's People

Contributors

curious-attempt-bunny avatar jwr avatar madstap avatar millettjon avatar nathell avatar noisesmith avatar pkaleta avatar rossabaker avatar tebeka avatar timowest avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

clj-tagsoup's Issues

would like a parse-xml-string function

I tried creating one as below but the startparse function is private:

(defn parse-xml-string
  [s]
  (xml/parse (-> s .getBytes ByteArrayInputStream.) tagsoup/startparse-tagsoup))

Any ideas welcome.

Incorrect nesting

In the wild, I have noticed that the results of parse-string don't always nest as expected. For example, if you pull in the DOM for https://www.google.com/search?q=dentist+pinellas+county+fl (I'm using clj-http to make the request), there is a div with id ires (the search results) that contains one child ol which itself contains 10 or so children with class g (each individual search result). The first of these children also contains class _Arj. However, the tagsoup result shows the ol and the _Arj g div as siblings, and the remaining g-classed divs as direct children of the body tag. I'm not sure if this is an issue with clj-tagsoup or something upstream, but I thought I'd bring it to your attention.

Add usage examples

Once the html is parsed, how can most efficiently query the parsed document? That is, I would want to be able to drill down as if it were a map:

(get-in x [:html :head :title])

It would be great if you added some recommendations how to do that transformation (for example https://github.com/cjohansen/hiccup-find looks promising).

Doesn't compile on Clojure 1.9.0-alpha12

The new alpha releases of Clojure do compile-time checking for core macros like ns, defn

Trying to require clj-tagsoup on Clojure 1.9.0-alpha12 throws the following big error and shows that the problem is actually when it trying to require clojure/data/xml.clj

clojure.lang.Compiler$CompilerException: 
clojure.lang.ExceptionInfo: Call to clojure.core/defn did not conform to spec:
In: [1 0] val: ({:keys (attrs)} writer) fails spec: :clojure.core.specs/arg-list at: [:args :bs :arity-1 :args] predicate: (cat :args (* :clojure.core.specs/binding-form) :varargs (? (cat :amp #{(quote &)} :form :clojure.core.specs/binding-form))),  Extra input
In: [1 0] val: {:keys (attrs)} fails spec: :clojure.core.specs/arg-list at: [:args :bs :arity-n :bodies :args] predicate: vector?
:clojure.spec/args  (write-attributes [{:keys (attrs)} writer] (doseq [[k v] attrs] (if (namespace k) (.writeAttribute writer (str (namespace k)) (name k) (str v)) (.writeAttribute writer (name k) (str v))))) 
...
compiling:(clojure/data/xml.clj:30:1)

The version of clojure/data/xml that clj-tagsoup is using contains the broken code and is also quite old. Bumping the clojure/data/xml dependency should fix the problem.

Fails with Clojure 1.10.1

If I update my Clojure dependency from 1.6.0 to 1.10.1, clj-tagsoup no longer works.

From the repl, I get the following when trying to :require tagsoup:

CompilerException java.lang.Exception: namespace 'clojure.data.xml' not found, compiling:(pl/danieljanus/tagsoup.clj:1:1)

From my test cases, I get:

In: [1 0] val: {:keys (attrs)} fails spec: :clojure.core.specs.alpha/arg-list at: [:args :bs :arity-n :bodies :args] predicate: vector?
In: [1 0] val: ({:keys (attrs)} writer) fails spec: :clojure.core.specs.alpha/arg-list at: [:args :bs :arity-1 :args] predicate: (cat :args (* :clojure.core.specs.alpha/binding-form) :varargs (? (cat :amp #{(quote &)} :form :clojure.core.specs.alpha/
binding-form))),  Extra input                           
 #:clojure.spec.alpha{:problems ({:path [:args :bs :arity-1 :args], :reason "Extra input", :pred (clojure.spec.alpha/cat :args (clojure.spec.alpha/* :clojure.core.specs.alpha/binding-form) :varargs (clojure.spec.alpha/? (clojure.spec.alpha/cat :amp #
{(quote &)} :form :clojure.core.specs.alpha/binding-form))), :val ({:keys (attrs)} writer), :via [:clojure.core.specs.alpha/defn-args :clojure.core.specs.alpha/args+body :clojure.core.specs.alpha/arg-list :clojure.core.specs.alpha/arg-list], :in [1 0
]} {:path [:args :bs :arity-n :bodies :args], :pred clojure.core/vector?, :val {:keys (attrs)}, :via [:clojure.core.specs.alpha/defn-args :clojure.core.specs.alpha/args+body :clojure.core.specs.alpha/args+body :clojure.core.specs.alpha/args+body :clo
jure.core.specs.alpha/arg-list :clojure.core.specs.alpha/arg-list], :in [1 0]}), :spec #object[clojure.spec.alpha$regex_spec_impl$reify__2436 0x79a1728c "clojure.spec.alpha$regex_spec_impl$reify__2436@79a1728c"], :value (write-attributes [{:keys (att
rs)} writer] (doseq [[k v] attrs] (if (namespace k) (.writeAttribute writer (str (namespace k)) (name k) (str v)) (.writeAttribute writer (name k) (str v))))), :args (write-attributes [{:keys (attrs)} writer] (doseq [[k v] attrs] (if (namespace k) (.
writeAttribute writer (str (namespace k)) (name k) (str v)) (.writeAttribute writer (name k) (str v)))))}, compiling:(clojure/data/xml.clj:30:1)

test.txt

Type errors after the type hinting added.

In working with the newer version I've discovered that versions after 0.3.0 / 2bea304 consistently cause this error on my machine:

pl.danieljanus.tagsoup=> (lazy-parse-xml "http://google.com")

ClassCastException com.sun.xml.internal.stream.events.AttributeImpl cannot be cast to javax.xml.stream.events.StartElement  pl.danieljanus.tagsoup/xml-name (tagsoup.clj:182)

Leiningen 2.5.3 on Java 1.8.0_65 Java HotSpot(TM) 64-Bit Server VM

Need to coexist with clojure.data.xml version 0.2.0-alpha6

Applications that need namespacing with XML require a newer clojure.data.xml, such as version 0.2.0-alpha6. But there were breaking changes between versions 0.0.3 and 0.1.0-alpha in which clojure.data.xml/event was removed. As a result, it's not possible to load clojure.data.xml version 0.2.0-alpha6 and tagsoup 0.0.3 at the same time.

It's not obvious to me how to transform the 0.0.3-dependent eventize code to work with newer versions of clojure.data.xml. Would it be possible/practical to simply drop the eventizing code for now to upgrade to a newer clojure.data.xml?

Attributes get lowercased; xmlns omitted

I tried parsing some Android XML configuration files; these have attributes, for instance, like

android:keyWidth="15%p"
android:horizontalGap="0px"
android:verticalGap="0px"

clj-tagsoup lowercases the attribute names-----e.g.:

android:keywidth

so that it doesn't roundtrip through Hiccup.

Additionally, it doesn't pick up the xmlns declaration in the root element; e.g:

xmlns:android="http://schemas.android.com/apk/res/android"

This has to be put back in by hand.

(I'm using Android as an example, but presumably these issues have relevance beyond Android.)

Thanks,

Nick.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.