Steps for web scraping with Ruby on Rails

web scraping ruby on rails

What is Web Scraping?

Web scraping is a technique to extract a large amount of data from a website and display it or store it in a file for further use.

Often, companies need to extract volumes of data from a particular site to process and analyze. Web scraping with Ruby on Rails is an easy way to do that.

It is used to crawl and extract the required data from a static website or a JS rendered website.

You can also use the best web scraping API tool to integrate data harvesting techniques directly with your Ruby on Rails project. The choice you make here all comes down to the type of scraping you want to carry out, the extent of your coding skills, and the degree to which you are hoping to automate the processes that are involved.

When talking about web scraping using Ruby, here are a few terms to get familiar with:

Few terms to get familiar with:

There are few tools available for web scrapings such as Nokogiri, Capybara and Kimurai. But, Kimurai is the most powerful framework to scrape data.

Kimurai

A web scraping framework in ruby works out of the box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows us to scrape and interact with JavaScript rendered websites. t also aligns with any Ruby web scraping library for better functionalities.

Features :

You can also scrape data from JS rendered websites, i.e. infinite scrollable websites and even static websites. Amazing right  !!! 

Read Also: Web scraping using Mechanize in Ruby on Rails

Static Websites:

You can use this framework in 2 ways:

  1. Making a rails app and extract information with the help of models and controllers.
    • Create a new rails app.

rails _5.2.3_ new web_scrapping_demo --database=postgresql

rails db:create

bundle install

rails g model Web Scrapper --parent Kimurai::Base

rails db:migrate

rails g controller WebScrappersController index

root 'web_scrappers#new'

resources: web_scrapper

<%= link_to 'Start Scrap', new_web_scrapper_path %>

def new
Web Scrapper.crawl!
end

Note:  Here, crawl!  Performs the full run of the spider. parse method is very important and should be present in every spider. The entry point of any spider is parse.

Here,

@name = name of the spider/web scraper

@engine = specifies the supported engine

@start_urls = array of start URLs to process one by one inside parse method.

@config = optional, can provide various custom configurations such as user_agent, delay, etc…

Read the Case Study about – Web Scraping RPA (Data Extraction)

Note: You can use several supported engines here, but if we use mechanize no configurations or installations are involved and work for simple HTTP requests but no javascript but if we use other engines such as selenium_chrome, poltergeist_phantomjs, selenium_firefox are all javascript based and rendered in HEADLESS mode.

Here, in the above parse method,

response = Nokogiri::HTML::Document object for the requested website.

URL = String URL of a processed web page.

data = used to pass data between 2 requests.

The data to be fetched from a website is selected using XPath and structures the data as per the requirement.

rails s

Hooray !! You have extracted information from the static website.

    • Open the terminal and install kimurai using the below-mentioned command:

gem install kimurai

ruby filename.rb

Dynamic Websites / JS rendered websites:

Pre-requisites:

Install browsers with web drivers:

For Ubuntu 18.04:

$ kimurai setup localhost --local --ask-sudo

Note: It works using Ansible. If not installed, install using:

$ sudo apt install ansible

sudo apt install -q -y unzip wget tar openssl
sudo apt install -q -y xvfb

You can use this framework in 2 ways:

gem install kimurai

ruby filename.rb

You can find the whole source code here.

Visit BoTree Technologies for excellent Ruby on Rails web development services and hire Ruby on Rails web developers with experience in handling marketplace development projects.

Reach out to learn more about the New York web development agencies for the various ways to improve or build the quality of projects and across your company.

Consulting is free – let us help you grow!

Related posts

Top Things to Know about .NET 6

byParth Barot
2 years ago

Python: Perfect Web Framework choice for Startups

byParth Barot
4 years ago

How to add React JS to your Ruby on Rails App with Webpacker

byAnkur Vyas
6 years ago
Exit mobile version