The Great Blog

Create Serverless API for HTML Scraping

September 08, 2019

Last week I wrote a post about how to set up serverless function with Zeit Now. I want to continue the topic about serverless functions and create an API which extracts, transforms, and loads data back to the user. For example, I picked showsrss.info website, which does not have an API, but responds with HTML document.

Selecting Tools

Tools we will use:

  • Now CLI
  • axios
  • cheerio

First of all, we need an HTTP client for node.js to request the website. I am very fond of axios which covers almost all browsers and node.js. cheerio is a library of core jQuery features for parsing markup and providing an API for traversing/manipulating the resulting data structure in the server. We will use it to scrape the HTML.

Also if you want to follow the steps, install Now CLI globally npm i -g now.

Setup Project

Let’s start from scratch and initialize project in the terminal:

mkdir showrss && cd showrss
npm init --yes
npm install axios cheerio

If you wonder what does npm init --yes mean, it responds to all answers with Yes. Great, we have a new project with just package.json and node_modules. Next step is to create an API endpoint for the users to send a request. For Zeit Now to create a serverless function with an endpoint, the project has to have a folder, named api in the root directory, and a file inside, which name will reflect the endpoint. Again in the terminal run:

mkdir api && cd api
touch shows.js

Create a Serverless Function

A file show.js should export a default function, which receives two arguments request and response. These are the standard HTTP request and response objects but enhanced with some helpers by Zeit Now.

// show.js
module.exports = (request, response) => {
  // send() method can receive a string, object or buffer
  // json() will send only JSON object
  response.send('Hello there!');
}

To test the endpoint, build the function in dev mode from project root directory with now dev. If you send a request to in the browser by calling http://localhost:3000/api/shows, you should get a greeting message.

Extract Data from Third-Party Service

When calling the serverless function, we can make a request call to the third party service showrss.info inside the function with axios helper. Couple points to be aware:

  • HTTP call will take some time, so we need to declare default function as an async
  • axios returns an object which has a field data, where the actual response will be stored.
const axios = require('axios');

module.exports = async (request, response) => {
  const showsResponse = await axios.get('https://showrss.info/browse');
  const htmlData = showsResponse.data;
  
  response.send(htmlData);
};

By now you should see a rendered HTML response.

Transfrom HTML into JS Object

After the response from the website, we get an HTML document. To traverse it and extract each show information: id, title and individual RSS link to the feed, load the document into cheerio with load() method. After that, data will be ready for extraction and accessible the same way as with jQuery.

const cheerio = require("cheerio");

  // ...
  const $ = cheerio.load(htmlData);
  const options = $("#showselector option");
  const showList = Object.keys(options)
    .map(key => {
      const show = $(options[key]);
      return {
        id: show.attr("value"),
        title: show.text(),
        rss: `http://showrss.info/show/${show.attr("value")}.rss`
      };
    });

  response.send(showList);
  // ...

Try to access en endpoint now, and you should receive an array with the list of all tv shows and an individual RSS feed link.

Deploy and Enjoy

The last step is to deploy the serverless function by running now in the terminal. After successful deployment, you’ll get an access link, and on the client, you can fetch already transformed list. I challenge you to make a new serverless function that receives show id as an argument and responds with additional data of that show.


Linas Spukas

Hi there! My name is Linas Spukas, I am a full stack web developer and this is my blog. About stuff and things... in development. Enjoy.