Bulding a data capature application using NodeJS and Puppeteer

Puppeteer is a NodeJs library, found in github, that provides control to Chrome/Chromium over the DevTools Protocol.

I have found it extremely useful to gather up student mark data from web based learning management systems. (LMS). The project requriements were to get student data for an external reporting system. The normal way to do this would be to access the backend database. One of the problems with accessing and pulling data from the database is calcuating the raw marks in the same way that the web application does. The programmers were able to come close to replicating this, but on occasion minor descrepencies crept in leading to conflicting scores. The application vendor was also unwilling to share the alogrithims to produce the scores, so a more accurate method had to be found. Why not just pull the exact marks directly from the web application and do away with recalculating the raw marks altogether? Enter Puppeteer.

Puppeteer when used with NodeJS becomes a powerful screen scrape tool. Within the application we developed a function to login. Accesing the application we were able to loop through urls. After which we looped through repeating elements.

How to log in

how to log in using puppeteer
Loop through elements and returning the inner text

The screen shots show how to login to a web application. Once logged in you can navigate to various parts of the site and pull out html elements. Puppeteer using querySelectorAll, querySelector and getAttribute allows access to inner text and attribute values. This is really quite powerful and fast when running queries against the database.

This is a simple example of how this interesting library can be used. To achieve this it took less than 500 lines of code and probably could be done even more efficiently.

Hope this helps somebody out there!

Regards

challengerX