aneta_bielska [:blog]

Web scraping with Ruby
#ruby #web-scraping

By writing couple lines of code you can get some data you need from a web that does not provide an API. To make it happen we will need HTTP parser. I am going to use Nokogiri. But there are other options. Check ruby toolbox for other tools.

Example

Add nokogiri to Gemfile:

gem 'nokogiri'

Then in your file you need to require:

require 'nokogiri'
require 'open-uri'

Now you are ready to get content of the page:

page = Nokogiri::HTML(open("http://www.example.com"))

Selectors

Let’s get some content:

page.css('a')[0] # find first a tag element

# => returns encoded HTML with nokogiri elements

#(Element:0x1480d7c {
  name = "a", attributes = [
    #(Attr:0x1480354 {
      name = "href",
      value = "http://www.iana.org/domains/example"
    })
  ],
  children = [ #(Text "More information...")]
})

# This equals to
# <a href="http://www.iana.org/domains/example">More information...</a>

Get more details:

page.css('a')[0].text

# => returns inner text of 'a' element

"More information..."

or

page.css('a')[0]['href']

# => returns value of href element

"http://www.iana.org/domains/example"

More selectors

page.css('title')                 # The <title> element

page.css("title")[0].text         # HTML document title value

page.css('li')                    # All <li> elements

page.css('li')[1]['href']         # The url of the second <li> element

page.css("li[data-attr='value']") # The <li> elements with a data-attr of value

page.css('div#foo')[0]            # The <div> element with an id of "foo"

page.css('div#foo a')             # The <a> elements nested inside the <div>
                                  # element that has an id of "foo"

Of course you can use Ruby methods on top of it:

page.css("a").length                           # a elemnts count

page.css("a").each{ |link| puts link['href'] } # puts all links



For more information check gem docs and this great tutorial. Check also my app repository to see nokogiri in action.

You may also enjoy:


#servers (1) #hosting (1) #sinatra (2) #assets (1) #sprockets (1) #react-js (1) #data-mapper (1) #ROM (1) #sequel (1) #ORM (1) #pg (1) #sqlite3 (1) #postgresql (1) #ruby (4) #ubuntu (1) #heroku (1) #git (1) #css (3) #cors (1) #same-origin-policy (1) #rake (1) #web-scraping (1) #ruby-on-rails (1) #brug (1) #elixir (1) #benchmark (1)