Web scraping

Web scraping 이란?

Web scraping3 (web harvesting or web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the World Wide Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding a fully-fledged web browser, such as Mozilla Firefox.

Web scraping is closely related to web indexing, which indexes information on the web using a bot or web crawler and is a universal technique adopted by most search engines. In contrast, web scraping focuses more on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to web automation, which simulates human browsing using computer software. Uses of web scraping include online price comparison, contact scraping, weather data monitoring, website change detection, research, web mashup and web data integration.

웹 스크래핑(Web scraping)은 웹 브라우저 화면에 표시되는 다양한 정보 중 사용자가 지정하거나 필요한 정보만을 추출하여 가공하고 저장하며 사용자에게 제공하는 기술을 의미한다.1

Python Mechanize is a module that provides an API for programmatically browsing web pages and manipulating HTML forms. BeautifulSoup is a library for parsing and extracting data from HTML. Together they form a powerful combination of tools for web scraping. -- 출처 : 2

관련 파이썬 라이브러리

Mechanize

A very useful python module for navigating through web forms. mechanize.Browser implements the urllib2.OpenerDirector interface. Browser objects have state, including navigation history, HTML form state, cookies, etc.4

BeautifulSoup

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. It makes browsing DOM a breeze with all its utility methods.4

개발 환경 설정

pip를 이용하여 다음과 같이 라이브러리를 설치한다.

pip install mechanize
pip install beautifulsoup4

예제15: 아마존 사이트에서 'Blackberry' 검색하기

소스코드

from mechanize import Browser
from bs4 import BeautifulSoup as BS

br = Browser()

# Browser options
# Ignore robots.txt. Do not do this without thought and consideration.
br.set_handle_robots(False)

# Don't add Referer (sic) header
br.set_handle_referer(False)

# Don't handle Refresh redirections
br.set_handle_refresh(False)

#Setting the user agent as firefox
br.addheaders = [('User-agent', 'Firefox')]

br.open('http://www.amazon.in/')
br.select_form(name="site-search")
br['field-keywords'] = "Blackberry"
br.submit()

#Getting the response in beautifulsoup
soup = BS(br.response().read())

for product in soup.find_all('h3', class_="newaps"):

    #printing product name and url
    print "Product Name : " + product.a.text
    print "Product Url : " + product.a["href"]
    print "======================="

실행 결과

Product Name : BlackBerry Z3 (Black, 8GB)
Product Url : http://www.amazon.in/BlackBerry-Z3-Black-8GB/dp/B00LFL8SUC/ref=sr_1_1?s=electronics&ie=UTF8&qid=1444874370&sr=1-1&keywords=Blackberry
=======================
Product Name : Apple iPhone 5s (Gold, 16GB)
Product Url : http://www.amazon.in/Apple-iPhone-5s-Gold-16GB/dp/B00FXLCD38/ref=sr_1_2?s=electronics&ie=UTF8&qid=1444874370&sr=1-2&keywords=Blackberry
=======================
...

세부 분석(5단계)

1단계: Getting the mechanize browser instance and setting browser option.

br = Browser()

# Ignore robots.txt.  Do not do this without thought and consideration.
br.set_handle_robots(False)

# Don't add Referer (sic) header
br.set_handle_referer(False)

# Don't handle Refresh redirections
br.set_handle_refresh(False)

#Setting the user agent as firefox
br.addheaders = [('User-agent', 'Firefox')]

2단계: Getting of search form using form name. (I got the search form name by going into amazon.in site and inspecting the form using chrome developer tools).

br.open('http://www.amazon.in/')
br.select_form(name="site-search")

3단계: Setting the value for the search field and submiting the form.

br['field-keywords'] = "Blackberry"
br.submit()

4단계: Getting the response into beautifulsoup.

soup = BS(br.response().read())

5단계: Access the response dom element to display the product name and url.

for product in soup.find_all('h3', class_="newaps"):
    #printing product name and url
    print "Product Name : " + product.a.text
    print "Product Url : " + product.a["href"]
    print "======================="

예제2: 네이버 무비 사이트에서 예매율 정보 가져오기

소스코드

#!/usr/bin/env python
# -*- coding: utf-8 -*- 

import sys
reload(sys)     
sys.setdefaultencoding("utf-8")

def main():
    from mechanize import Browser
    from bs4 import BeautifulSoup as BS

    br = Browser()

    # Browser options
    # Ignore robots.txt. Do not do this without thought and consideration.
    br.set_handle_robots(False)

    # Don't add Referer (sic) header
    br.set_handle_referer(False)

    # Don't handle Refresh redirections
    br.set_handle_refresh(False)

    #Setting the user agent as firefox
    br.addheaders = [('User-agent', 'Firefox')]

    br.open('http://movie.naver.com/movie/sdb/rank/rreserve.nhn')
    #     br.select_form(name="site-search")
    #     br['field-keywords'] = "Blackberry"
    #     br.submit()

    #Getting the response in beautifulsoup
    soup = BS(br.response().read(), "html.parser")

    # temporary output
    #print soup.original_encoding
    #print soup.prettify()

    rank = soup.find_all('div', class_="tit4")
    reservation = soup.find_all('td', class_="reserve_per ar")
    print '예매 랭킹 TOP 10'
    print '순위 | 영화명 | 예몌율 | 주소'
    for i in xrange(0, 10):
        #print rank[i]
        #print reservation[i]
        print '%4d' % (i), '|', rank[i].a.string.strip(), '|', reservation[i].string, '|', 'http://movie.naver.com' + rank[i].a['href']

if __name__ == "__main__":
    print('...웹 스크래핑 시작...')
    main()
    print('...웹 스크래핑 종료...')

실행 결과

...웹 스크래핑 시작...
예매 랭킹 TOP 10
순위 | 영화명 | 예몌율 | 주소
   0 | 마션 | 26.45% | http://movie.naver.com/movie/bi/mi/basic.nhn?code=129049
   1 | 성난 변호사 | 20.52% | http://movie.naver.com/movie/bi/mi/basic.nhn?code=129051
   2 | 인턴 | 12.23% | http://movie.naver.com/movie/bi/mi/basic.nhn?code=118917
   3 | 탐정 : 더비기닝 | 7.68% | http://movie.naver.com/movie/bi/mi/basic.nhn?code=124201
   4 | 사도 | 6.29% | http://movie.naver.com/movie/bi/mi/basic.nhn?code=121922
   5 | 트랜스포터: 리퓰드 | 5.3% | http://movie.naver.com/movie/bi/mi/basic.nhn?code=130978
   6 | 라이프 | 2.12% | http://movie.naver.com/movie/bi/mi/basic.nhn?code=122494
   7 | 주온: 더 파이널 | 1.99% | http://movie.naver.com/movie/bi/mi/basic.nhn?code=140015
   8 | 지금은맞고그때는틀리다 | 1.67% | http://movie.naver.com/movie/bi/mi/basic.nhn?code=140199
   9 | 리그레션 | 1.54% | http://movie.naver.com/movie/bi/mi/basic.nhn?code=114278
...웹 스크래핑 종료...

추가 학습자료

참고 문헌

1 http://www.modutech.co.kr/html_labs/research_view.html?category=2&no=123&menu_no=92&detail_menu=37

2 http://toddhayton.com/2014/12/08/form-handling-with-mechanize-and-beautifulsoup/

3 https://en.wikipedia.org/wiki/Web_scraping

4 https://pythondevs.wordpress.com/2014/04/08/web-scrapping-using-mechanize-and-beautifulsoup

5 https://pythondevs.wordpress.com/2014/04/08/web-scrapping-using-mechanize-and-beautifulsoup/