Web Scraping with Node and Puppeteer

22 November, 2020

In this post, we'll be making our first small little web scraping app.

Before we get started, let's just talk a little bit about web scraping and what it is. The most simlified definition for web scraping is "extracting data from websites", which is somewhat implied by the name. It has always been very much of a grey area. Going into a legal discussion is beyond the scope of this article, though I will recommend this blog post going into deeper detail about that.

So, to introduce today's project, we'll be building a simple GitHub follower counter, to count how many followers a user has on GitHub through the terminal.

#

Initialising

First, let's make a directory for this repository.

terminal
mkdir github-follower-counter

cd github-follower-counter

Open it in your code editor. If you're using Visual Studio Code you can simply do code .

Initialise with Yarn (or NPM).

terminal
yarn init -y

# For NPM
# npm init -y

Install puppeteer:

terminal
yarn add puppeteer

# For NPM
# npm i puppeteer

#

Getting started with the code

First off, let's import puppeteer into our project.

index.js
const puppeteer = require('puppeteer')

Now, let's get the terminal arguments from the user. To do this, we can use process.argv:

index.js
let username = process.argv[2]

if (username == null) return console.log('Error! Please specify a user!')

Next, let's create our getFollowers function.

index.js
const getFollowers = async (user = `https://github.com/${username}`) => {}

Inside it, let's launch the browser, open a new tab, and navigate to the URL.

index.js
let browser = await puppeteer.launch()
let page = await browser.newPage()
await page.goto(user)

Inside it, let's evaluate the page.

index.js
let githubFollowers = await page.evaluate(() => {})

Now, let's get the follower count. If we navigate over to GitHub, and right click < view page source (or ctrl+u). We can see the code of the website.

Inside of here, we can see that the span element, with the class of text-bold text-gray-dark has the current follower count.

image

Back to our code, let's do:

index.js
const followerCount = document.querySelector('span.text-bold').innerHTML

Now, let's output the results. There is an error however. If a user does not exist, then it will show us as "optional" on the follower count. To prevent this, we can add:

index.js
if (followerCount == 'optional')
  return 'Error! Incorrect username, make sure to double check your spelling.'
else return `That user has a total of ${followerCount} followers!`

Next, back to our function, let's output this.

index.js
let githubFollowers = await page.evaluate(() => {
   const followerCount = document.querySelector('span.text-bold').innerHTML

   if (followerCount == 'optional') return('Error! Incorrect username, make sure to double check your spelling.')
   else return(`That user has a total of ${followerCount} followers!`)
})

console.log(githubFollowers)

Make sure to close the browser window as well.

index.js
await browser.close()

At the bottom, don't forget to call this function.

index.js
getFollowers()

...and you should be good to go! Make sure to run node index.js followed by a username to test it out.

_Note: a far better way to do this is to use the GitHub API. This was primarily a way on how to select and get certain elements, if you're looking to make an actual project with this, then the GitHub API is the way to go!

Thanks for reading, and Happy Thanksgiving. 👋