Skip to content

Presence of <script> in HTML Causing Parser to Miss Selection #67

@defndaines

Description

@defndaines

I have a script (shortened to minimum reproducible case below) where the selection for "table tr" fails to find the first <tr> tag. The example below only contains one <tr> tag, but I have other use cases where there are 9 tags, but only the last 8 are returned. But I have found that if I comment out the <script> from the HTML, then it does return the correct <tr> tag results.

(This is from a web scraper I've been playing with to pull information from goodreads.)

Test script:

local parser = {}

local htmlparser = require("htmlparser")

function parser.book_link(html, title, author)
	local tree = htmlparser.parse(html)
	local books = tree:select("table tr")

	for _, book in ipairs(books) do
		local book_title = book:select("a.bookTitle")

		if book_title[1].nodes[1]:getcontent():match("^" .. title) then
			local aut = book:select("a.authorName")
			if aut[1].nodes[1]:getcontent():match(author) then
				return "https://www.goodreads.com" .. book_title[1].attributes["href"]:gsub("?.*", "")
			end
		end
	end

	return nil
end

local file = io.open("gr.html", "r")
local search_html = file:read("*a")
file:close()

local title = "Waste Tide"
local author = "Chen Qiufan"
local book_link = parser.book_link(search_html, title, author)

assert("https://www.goodreads.com/book/show/39863294-waste-tide" == book_link, "Incorrect book link: " .. book_link)

Test file:

<html><body>
    <script type="text/javascript" charset="utf-8">
      function refreshGroupBox(group_id, book_id) {
        new Ajax.Updater('addGroupBooks' + book_id + '', '/group/add_book_box', {asynchronous:true, evalScripts:true, onSuccess:function(request){refreshGroupBoxComplete(request, book_id);}, parameters:'id=' + group_id + '&book_id=' + book_id + '&refresh=true' + '&authenticity_token=' + encodeURIComponent('g0GG+Rcqg7zUv1eOBiN/m0Gxr1TlkcUeCyRfv9ZM7OGYokz03bxSNPCIOn1o7esOziTbneeb1ztimZdGK0srsg==')})
      }
    </script>
    <table><tr><td>
          <a class="bookTitle" href="/book/show/39863294-waste-tide?from_search=true&amp;from_srp=true&amp;qid=WILgnaZ5jh&amp;rank=1">
            <span itemprop='name'>Waste Tide</span>
          </a>
          <a class="authorName"><span itemprop="name">Chen Qiufan</span></a>,
        </td></tr></table>
  </body></html>

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions