Web scraping is a process to extract useful information from the World Wide Web. During a google search, a web crawler(bot), i.e., the crawler, goes through nearly all the content on the web and picks what’s relevant to you.

This idea that information or knowledge should be accessible to everyone led to the formation of the world wide web. However, the data which you are seeking must be permitted for public use.

How is web scraping useful?

We live in the age of data. Raw data can be converted into useful information which can be used to serve a bigger purpose with the help of web scraping. It can be used to analyze and study the users of a product to improve the product, in other words, to create a feedback loop.

Ecommerce companies may use it to study competitors’ pricing strategies and accordingly work out their own. Web scraping can also be used for weather and news reporting.

Challenges

#1. IP Restriction

Several websites limit the number of requests you can make to get the site’s data in a certain time interval by detecting your IP address or geolocation. They do so to prevent malicious attacks on their website.

#2. Captcha

Distinguishing between a real human and a bot trying to access the website is what captchas actually do. Websites use it to prevent spam on the website as well as to control the number of scrapers on the website.

#3. Client Side Rendering

This is one of the biggest obstacles for web scrapers. Modern websites use frontend frameworks that are capable of creating single-page applications. Most single-page applications do not have server-rendered content. Instead, they generate the content as per need using client-side javascript. This makes it difficult for scrapers to know what’s the content of a webpage. In order to get the content, you need to render some client-side javascript.

The Geekflare API

A web scraping API solves most of the challenges faced while performing web scraping because it handles everything for you. Let’s explore the Geekflare API and see how you can use it for web scraping.

Geekflare’s API has a simple three-step process for you:

  • Provide a URL to scrape
  • Provide some configuration options
  • Get the data

It can scrape the webpage for you and then return raw HTML data as a string or in an HTML file which can be accessed through a link, whichever way works for you.

Using the API

In this tutorial, you will learn how to use the Geekflare API using NodeJS – a javascript runtime environment. Install NodeJS on your system if you haven’t before proceeding further.

  • Run the command npm init -y in the terminal at the current folder or directory. It will create a package.json file for you.
  • Inside the package.json file, change the main key value to index.mjs if it’s something else by default. Alternatively, you can also add a key type and set its value equal to module.
{
  “type”: “module”
}
  • Add a dependency named axios by running npm i axios command in the terminal. This dependency helps us to make fetch requests to specific endpoints.
  • Your package.json should look something like this:
{
  "name": "webscraping",
  "version": "1.0.0",
  "description": "",
  "main": "index.mjs",
  "scripts": {
    "test": "echo "Error: no test specified" && exit 1"
  },
  "author": "Murtuzaali Surti",
  "license": "ISC",
  "dependencies": {
    "axios": "^1.1.3"
  }
}
  • Import axios inside the index.mjs file just like this. Here, the import keyword is used because it’s an ES module. If it was a commonJS file, it would have been the require keyword.
import axios from ‘axios’
  • The base URL for all our requests to the Geekflare API will be the same for every endpoint. So, you can store it inside a constant.
const baseUrl = 'https://api.geekflare.com'
  • Specify the URL which you want to scrape and get the data from.
let toScrapeURL = "https://developer.mozilla.org/en-US/"
  • Create an asynchronous function and initialize axios inside it.
async function getData() {
    const res = await axios({})
    return res
}
  • In the axios configuration options, you should specify the method as post, the URL along with the endpoint, a header known as x-api-key whose value will be the API key provided by Geekflare and lastly, a data object which will be sent to the Geekflare API. You can get your API key by going to dash.geekflare.com.
const res = await axios({
    method: "post",
    url: `${baseUrl}/webscraping`,
    headers: {
        "x-api-key": "your api key"
    },
    data: {
        url: toScrapeURL,
        output: 'file',
        device: 'desktop',
        renderJS: true
    }
})
  • As you can see, the data object has the following properties:
    • url: the URL of a webpage that needs to be scraped.
    • output: the format in which the data is presented to you, either inline as a string or in an HTML file. Inline string is the default value.
    • device: the type of device in which you want the webpage to be opened. It accepts three values, ‘desktop’, ‘mobile’, and ‘tablet’, with ‘desktop’ being the default value.
    • renderJS: a boolean value to specify whether you want to render javascript or not. This option is useful when you are dealing with client-side rendering.
  • Call the asynchronous function and get the data. You can use an IIFE (Immediately Invoked Function Expression).
(async () => {
    const data = await getData()
    console.log(data.data)
})()
  • The response will be something like this:
{
  timestamp: 1669358356779,
  apiStatus: 'success',
  apiCode: 200,
  meta: {
    url: 'https://murtuzaalisurti.github.io',
    device: 'desktop',
    output: 'file',
    blockAds: true,
    renderJS: true,
    test: { id: 'mvan3sa30ajz5i8lu553tcckchkmqzr6' }
  },
  data: 'https://api-assets.geekflare.com/tests/web-scraping/pbn0v009vksiszv1cgz8o7tu.html'
}

Parsing HTML

In order to parse HTML, you can use an npm package named node-html-parser and to extract data from HTML as well. For example, if you want to extract the title from a webpage, you can do:

import { parse } from ‘node-html-parser’
const html = parse(htmlData) // htmlData is the raw html string you get from the Geekflare API.

Alternatively, if you only want the metadata from a website, you can use Geekflare’s metadata API endpoint. You don’t even have to parse HTML.

Benefits of using Geekflare API

In single page applications, the content is often not server-rendered, instead it is rendered by the browser using javascript. So, if you scrape the original URL without rendering the javascript needed to render the content, you get nothing but a container element with no content in it. Let me show you an example.

Here’s a demo website built using react and vitejs. Scrape this site using the Geekflare API with the renderJS option set to false. What did you get?


    

There’s just a root container without content. This is where the renderJS option comes into action. Now try to scrape the same site with the renderJS option set to true. What do you get?


    

Vite React

Edit src/App.jsx and save to test HMR

Click on the Vite and React logos to learn more

Another benefit of using Geekflare API is that it allows you to use a rotating proxy so that you can ensure no IP blocking will occur by the website. The Geekflare API includes the proxy feature under its premium plan.

Final Words

Using a web scraping API allows you to focus only on the scraped data without too much technical hassle. Apart from that, the Geekflare API also provides features such as broken link checking, meta scraping, website load statistics, screenshot capturing, site status and much more. All of that under a single API. Check out the official documentation of the Geekflare API for more information.