PrettySwiftSoup
is a convention-compliant version of SwiftSoup
.
SwiftSoup
is a Swift library used for parsing and manipulating HTML documents. SwiftSoup
allows developers to extract data from HTML, manipulate the Document Object Model (DOM), and handle character encodings, making it useful for tasks like web scraping and HTML parsing in Swift applications.
By inheriting SwiftSoup
, PrettySwiftSoup
improves the object naming conventions to align with Swift standards.
PrettySwiftSoup
provides renamed objects, appropriate argument labels for methods, and built-in documentation to ensure continuity with the familiar Swift development style.
Warning
Object symbols may continue to change until version 1.0.0
is released. Please pay attention to dependency version management.
PrettySwiftSoup
is available through Swift Package Manager.
To install it, simply add the dependency to your Package.Swift file:
...
dependencies: [
.package(url: "https://github.com/GST/PrettySwiftSoup.git", from: "0.1.0"),
],
targets: [
.target( name: "YourTarget", dependencies: ["PrettySwiftSoup"]),
]
...
PrettySwiftSoup |
SwiftSoup |
---|---|
0.1.0 |
2.7.2 |
let html =
"""
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>
"""
guard let document: HTMLDocument = HTMLParser.parse(html) else {
fatalError("Failed to parse HTML document")
}
let headingElements: HTMLElements = document.getElementsByTag(named: "h1")
let element: HTMLElement? = headingElements.first
let textInElement: String = element!.text
print(textInElement)
// Prints "My First Heading"
HTMLDocument
is a subclass ofHTMLElement
which includes entire elements tree in an HTML document. Use theHTMLParser.parse(_:)
method to get anHTMLDocument
HTMLDocument
is the root node of the elements tree. To find elements with a specific tag name in an HTML tree, useHTMLElement
's methodgetElementsByTag(named:)
. You can also use these methods to get specific elements:getElementByID(_:)
getElementsByAttribute(named:)
getElementsByClass(named:)
- and more methods like these...
- You retrieve a list of elements of type
HTMLElements
.HTMLElements
is a reference type that containsHTMLElement
objects. You can handle this type like a built-inArray
in Swift. In this example, I usedfirst
optional property to get the first element. HTMLElement
contains data of an HTML element, such as text, attribute, clas snames, etc. I tried to print the element's text, and it is printed successfully.
Let's assume html
is a String
variable with this HTML code:
<div id="main" class="container" role="main-container" data-version="0.1.0">
<header class="header main-header">
<h1>Welcome to the Example Page</h1>
</header>
<nav class="navbar" id="navBar" role="navigation">
<ul class="nav-list">
<li class="nav-item"><a href="/section1">Section 1</a></li>
<li class="nav-item"><a href="/section2">Section 2</a></li>
</ul>
</nav>
<footer class="footer">
<p>Footer information</p>
</footer>
<script>console.log('Hello, world!');</script>
</div>
HTMLElement
contains information about an HTML element as a node in the the tree:
guard let document = HTMLParser.parse(html, baseURI: "http://example.com/") else {
fatalError("Failed to parse")
}
if let mainElement = document.getElementById("main") {
let className: String? = mainElement.className // Optional("container")
let classNames: OrderedSet<String> = mainElement.classNames // ["container"]
let tagName: String = mainElement.tagName // "div"
let normalizedTagName: String = mainElement.tagNameNormal // "div"
let role: String? = mainElement.getAttribute(withKey: "role") // Optional("main-container")
let style: String? = mainElement.getAttribute(withKey: "style") // nil
let nonTextContent: String? = mainElement.nonTextContent // Optional("console.log('Hello, world!');")
let datas: [String:String] = mainElement.datas // ["version":"0.1.0"]
let selector: String = mainElement.cssSelector // "#container"
// Get the first child (the <header> element)
guard let firstChild: HTMLElement = mainElement.firstChild else {
return
}
let className2 = firstChild.className // Optional("header main-header")
let classNames2 = firstChild.classNames // ["header", "main-header"]
let text: String = firstChild.text // "Welcome to the Example Page"
let ownText: String = firstChild.ownText // ""
// Get the second child (the <nav> element)
guard let secondChild: HTMLElement = mainElement.getChild(at: 1) else {
return
}
let id: String? = secondChild.id // Optional("navBar")
// Get the third child (the <footer> element) which is the next sibling
guard let thirdChild: HTMLElement = secondChild.nextSiblingElement else {
return
}
let tagName2: String = thirdChild.tagName // "footer"
let elementWithHref = mainElement
.getElementById("navBar")?
.firstChild?
.getElementsByTag(named: "a")
.first
if let elementWithHref {
let urlPath: String? = elementWithHref.absoluteURLPath(ofAttribute: "href") // Optional("http://example.com/section1")
}
}
- Specify the
baseURI:
argument inHTMLParser.parse(_:baseURI:)
to set the base URI of all elements in the document. - Get elements by calling
getElementByID(:)
,getElementsByTag(named:)
,getElementsByClass(named:)
, etc. These methods search all descendants of the element including itself. You can also useselect(cssQuery:)
to find elements using a CSS selector. - Use
HTMLElement
's properties and methods to get corresponding properties of an HTML element, such asclassName
,tagName
,getAttribute(withKey:)
, etc.- The
nonTextContent
property represents an elements non-textual contents, like<script>
,<style>
, etc. This includes all non-textual content of the element's descendants. - The
text
property includes not only the element's own text but also the text of its descendants.
- The
- Access relative elements using
firstChild
,getChild(at:)
,nextSiblingElement
,parent
, etc.
guard let document = HTMLParser.parse(html, baseURI: "http://example.com/") else {
fatalError("Failed to parse")
}
let heading = document.getElementsByTag(named: "h1").first!
try! heading.setText("Welcome to Silent Hill")
.setTagName("h2")
.setAttribute(withKey: "foo", value: "bar")
let list = document.getElementById("navBar")!.firstChild!
try! list.appendElement(tagName: "li")
.setClass(names: ["nav-item", "customized"])
.appendElement(tagName: "a")
.setAttribute(withKey: "href", value: "/section3")
.setText("Section 1")
do {
let newElement = HTMLElement(tag: .init("h3"), baseURI: document.baseURI ?? "")
newElement.setText("Hello, world!")
.setClass(names: ["created"])
try heading.insertHTMLAsNextSibling("<br>Bruh")
.insertNodeAsNextSibling(newElement)
} catch let error as SwiftSoupError {
if error == .failedToParseHTML {
print("Invalid HTML")
} else {
print("SwiftSoup error: \(error.localizedDescription)")
}
} catch {
print("Unknown error: \(error.localizedDescription)")
}
let footer = document.getElementsByTag(named: "footer").first!
footer.remove()
- Use setter methods to set properties of an HTML element.
- Almost all setters returns
self
so that you can chain the setter methods. - To add a new element, use
appendElement(tagName:)
,insertHTMLAsNextSibling(_:)
,insertNodeAsNextSibling(_:)
, etc.appendElement(tagName:)
allows you to add new element with a tag name. The new element is inserted as the last child of the caller. You can then set several properties of the element.insertHTMLAsNextSibling(_:)
allows you to add a new element by parsing the given HTML code. The new element is inserted as the following sibling of the caller.insertNodeAsNextSibling(_:)
allows you to add an instance ofHTMLElement
.- There are more methods like these, and they return
self
for method chaining.
- Call
remove()
to remove an element. This removes an element from the HTML tree but does not deinitialize the object or free the memory. - Note that setting and adding methods throw errors. For example, if you try to set an empty tag, it throws
SwiftSoupError.emptyTagName
. Possible errors for each method are described in the built-in documentation.
You want to allow untrusted users to supply HTML for output on your website (e.g. as comment submission). You need to clean this HTML to avoid cross-site scripting (XSS) attacks.
Use the HTML Cleaner
with a configuration specified by a Whitelist
.
do {
let unsafe: String = "<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>"
let safe: String = try HTMLParser.cleanBodyFragment(unsafe, whitelist: Whitelist.basic())
// now: <p><a href="http://example.com/" rel="nofollow">Link</a></p>
} catch let error as SwiftSoupError {
print(error.localizedDescription)
} catch {
print("Unknown Error: \(error.localizedDescription)")
}
If you supply a whole HTML document, with a <head>
tag, the cleanBodyFragment(_:baseURI:whitelist:)
method will just return the cleaned body HTML.
You can clean both <head>
and <body>
by providing a Whitelist
for each tags.
do {
let unsafe: String =
"""
<html>
<head>
<title>Hey</title>
<script>console.log('hi');</script>
</head>
<body>
<p>Hello, world!</p>
</body>
</html>
"""
var headWhitelist: Whitelist = {
do {
let customWhitelist = Whitelist.none()
try customWhitelist
.addTags("meta", "style", "title")
return customWhitelist
} catch {
fatalError("Couldn't init head whitelist")
}
}()
guard let unsafeDocument = HTMLParser.parse(unsafe) else { return }
let safe: String? = try Cleaner(headWhitelist: headWhitelist, bodyWhitelist: .relaxed())
.clean(unsafeDocument)
.html
// now: <html><head><title>Hey</title></head><body><p>Hello, world!</p></body></html>
} catch let error as SwiftSoupError {
print(error.localizedDescription)
} catch {
print("Unknown Error: \(error.localizedDescription)")
}
A cross-site scripting attack against your site can really ruin your day, not to mention your users'. Many sites avoid XSS attacks by not allowing HTML in user submitted content: they enforce plain text only, or use an alternative markup syntax like wiki-text or Markdown. These are seldom optimal solutions for the user, as they lower expressiveness, and force the user to learn a new syntax.
A better solution may be to use a rich text WYSIWYG editor (like CKEditor or TinyMCE). These output HTML, and allow the user to work visually. However, their validation is done on the client side: you need to apply a server-side validation to clean up the input and ensure the HTML is safe to place on your site. Otherwise, an attacker can avoid the client-side Javascript validation and inject unsafe HMTL directly into your site
The SwiftSoup whitelist sanitizer works by parsing the input HTML (in a safe, sand-boxed environment), and then iterating through the parse tree and only allowing known-safe tags and attributes (and values) through into the cleaned output.
It does not use regular expressions, which are inappropriate for this task.
SwiftSoup provides a range of Whitelist
configurations to suit most requirements; they can be modified if necessary, but take care.
The cleaner is useful not only for avoiding XSS, but also in limiting the range of elements the user can provide: you may be OK with textual a
, strong
elements, but not structural div
or table
elements.