REPL Adventures: Web Scraping With The Web Console

2023-10-17

Ever since I began mucking around with node.js I’ve been deeply enamoured with the REPL. For the first little while I had thought it was my first experience with a REPL (turns out bash was, I just didn’t realize it yet). It wasn’t long before I learned the absolute power a REPL provides to a developer (assuming you’re willing to cram a ton of logic into one line). But that’s not the REPL we are going to discuss today. Instead it’s the Web Console, and we’re going to use it to scrape a website.

Now, obviously, if we’re wanting to scrape the same website repeatedly, the Web Console is not the proper tool, but it still can be invaluable to get one’s logic straight.

I wanted to convert an image gallery into a CSV file. This image gallery happened to be hand rolled (maybe a script made it, it was very nice and uniform), but the key factor was it was all on one page. The HTML looked like this:

<div class="gallery">
  <a target="_blank" href="p/f/20211217_ThreeWiseMen.jpg">
    <img src="p/t/20211217_ThreeWiseMen.jpg" alt="Three Wise Men">
  </a>
  <div class="desc">Dec. 17th. Three Wise Men</div>
</div>

<div class="gallery">
  <a target="_blank" href="p/f/20211217_ThreeMoreWiseMen.jpg">
    <img src="p/t/20211217_ThreeMoreWiseMen.jpg" alt="Three More Wise Men">
  </a>
  <div class="desc">Dec. 17th. Three More Wise Men</div>
</div>

So, the first thing to do is grab the first entry:

> $('.gallery')
<div class="gallery">

We can see a few things here, my input is the $('.gallery') call, and the result of that call is printed on the next line. Not shown here is firefox’s gear icon, which allows us to interact with the result as if it were in the Inspector tab of the Web Developer’s Toolbox. This can be helpful, but we won’t need it today.

> $('.gallery').innerHTML
'
  <a target="_blank" href="p/f/20211217_ThreeWiseMen.jpg">
    <img src="p/t/20211217_ThreeWiseMen.jpg" alt="Three Wise Men">
  </a>
  <div class="desc">Dec. 17th. Three Wise Men</div>
'

Okay, here we get a better view of what we’re actually after. We want four bits of information: The p/f URL, the p/t URL, the alt-text, and the description. I quickly test a theory:

> $('div', $('.gallery')).innerHTML
'Dec. 17th. Three Wise Men'

Great! You can see here that $() can take a second argument, an HTMLNode to be scanned instead of document. The three other bits of information can be easily gathered:

> n = $('.gallery'); [$('a', n).href, $('img', n).src, $('img', n).alt, $('div', n).innerHTML]
Array(4) [ "http://www.nicesparks.com/p/f/20211217_ThreeWiseMen.jpg", "http://www.nicesparks.com/p/t/20211217_ThreeWiseMen.jpg", "Three Wise Men", "Dec. 17th. Three Wise Men" ]

One thing to note: the last expression is the one that will be printed. In this case, it’s that array line, which I use to print all four values I’m after at once. Now we’re going to want to iterate over all of the elements:

> a = []; g = $$('.gallery'); for (var n of g) { }; a
Array []

You’ll notice I’m missing the break-out step of the last line, this is a quick sanity check against my syntax before it gets gnarly. You always want to know that your syntax is sane before things get complicated.

> a = []; g = $$('.gallery'); for (var n of g) { a.push([$('a', n).href, $('img', n).src, $('img', n).alt, $('div', n).innerHTML]) }; a
Uncaught TypeError: can't access property "href", $(...) is null

That… was not expected. Do a quick peek:

> a.length
120
> g[120].innerHTML
"
  ==================== VACATION! ==============
"

Ahh. Okay, that explains that. We don’t really care about that to be honest, so we’ll just swallow some errors. But, we might want to see what else goes wrong.

> f = []; a = []; g = $$('.gallery'); for (var n of g) { try { a.push([$('a', n).href, $('img', n).src, $('img', n).alt, $('div', n).innerHTML]) } catch (e) { f.push(n.innerHTML) } }; a 
Array(409) [ (4) […], (4) […], (4) […], (4) […], (4) […], (4) […], (4) […], (4) […], (4) […], (4) […], … ]
> f.length
2
> f
Array [ "\n  ==================== VACATION! ==============\n", "\n  ==================== Old stuff ==============\n" ]

Okay, there weren’t too many of those informational blocks. But we now have our results in a! So let’s get that into something usable:

> copy(JSON.stringify(a, null, '\t'))
String was copied to clipboard.

The copy() function is unique to the Web Console, (as are $() and $$()), this will copy a block of text into our clipboard. We use the extra arguments of JSON.stringify() to pretty-print the resulting JSON. The second argument is for a transformation function, which we don’t need, so it is set to NULL, the third argument is the indentation character, which is a tab. With our JSON in the clipboard we quickly pop a shell:

; xsel > gallery.json
; head gallery.json
[
    [
        "http://www.nicesparks.com/p/f/20211217_ThreeWiseMen.jpg",
        "http://www.nicesparks.com/p/t/20211217_ThreeWiseMen.jpg",
        "Three Wise Men",
        "Dec. 17th. Three Wise Men"
    ],
    [
        "http://www.nicesparks.com/p/f/20211217_ThreeMoreWiseMen.jpg",
        "http://www.nicesparks.com/p/t/20211217_ThreeMoreWiseMen.jpg",

Awesome! The xsel command just dumps the clipboard to standard out, and we write it to “gallery.json”, then we just check what we have with head which prints the first ten lines of a file. A nice quick web-scraping!

FOSS Unleashed

REPL Adventures: Web Scraping With The Web Console

About

Tags

Tag Cloud

Archives

Recents