Scrapyをインストールしてバックグラウンドでスクレイピングするまで

1. Python2.7とpipのインストール

まずはPythonをソースからインストールします。

$ yum install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel
$ cd /usr/local/src
$ wget https://www.python.org/ftp/python/2.7.11/Python-2.7.11.tgz
$ tar xvfz Python-2.7.6.tgz
$ cd Python-2.7.6
$ ./configure --prefix=/usr/local
$ make
$ make altinstall

次にpipのインストールです。easy_installでpipを入れることもできますが、今では下記のコマンドで一発でpipを入れることができます。

$ curl -kL https://raw.github.com/pypa/pip/master/contrib/get-pip.py | python

また、easy_installを使ってpipをインストールする方法は以下の通りです。easy_installを使うのにsetuptoolsではなくdistributeを利用しています。

$ wget http://pypi.python.org/packages/source/d/distribute/distribute-0.6.27.tar.gz
$ tar zxvf distribute-0.6.27.tar.gz
$ cd distribute-0.6.27
$ sudo python2.7 setup.py install
$ easy_install-2.7 pip

2. Scrapyのインストール

次にpipを使ってscrapyをインストールします。

$ pip install scrapy
    Compile failed: command 'gcc' failed with exit status 1
    creating tmp
    cc -I/usr/include/libxml2 -c /tmp/xmlXPathInitacjqEE.c -o tmp/xmlXPathInitacjqEE.o
    /tmp/xmlXPathInitacjqEE.c:1:26: error: libxml/xpath.h: そのようなファイルやディレクトリはありません
    *********************************************************************************
    Could not find function xmlCheckVersion in library libxml2. Is libxml2 installed?
    *********************************************************************************
    error: command 'gcc' failed with exit status 1

エラーですね。依存関係をクリアできなかったみたいです。libxml/xpath.hがないとのことなのでインストールします。

$ yum -y libxslt-devel

これで再度インストールすればうまくいきました。

$ pip install scrapy

3. Scrapyを使ってみる

Scrapyを使う簡単な流れは以下のようになります。

Scrapy プロジェクトの作成
基本設定と抽出するアイテムの定義
スクレイピング＆クロール用のSpider作成
抽出したアイテムのパイプライン処理作成

3-1. Scrapy プロジェクトの作成

scrapyコマンドを使うことでプロジェクトを作成できます。プロジェクト名は英数字またはアンダースコアしか使えないので注意です。

$scrapy startproject sample_crawler

すると以下のようなディレクトリが作成されます。

sample_crawler/
├ sample_crawler
│ ├ __init__.py
│ ├ items.py
│ ├ pipelines.py
│ ├ settings.py
│ └ spiders
│    └ __init__.py
└── scrapy.cfg

ここで主に編集するのは、以下のファイルです。

items.py      データ保存のためのクラス
settings.py   基本的な設定
pipelines.py  抽出データの出力先を指定
spiders/      spiderのディレクトリ

3-2. 基本設定と抽出するアイテムの定義

settings.pyでクローラーの設定をすることができます。今回はサンプルなので、以下の部分のみ変更しました。

DOWNLOAD_DELAY=5
BOT_NAME = 'samplespider'

＊settings.pyの詳しい設定内容はScrapyの公式ドキュメント： http://doc.scrapy.org/en/latest/topics/settings.html

また、クロールで抽出するアイテムをitem.pyで定義します。以下はURLとタイトルを取得する例です。

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy

class CrawlerItem(scrapy.Item):
    # define the fields for your item here like:
    URL = scrapy.Field()
    title = scrapy.Field()
    pass

3-3. スクレイピング＆クロール用のSpider作成

scrapyコマンドでspiderのテンプレートを作成します。

$ scrapy genspider -t crawl sample_spider example.com

すると、spiders/以下にsample_spider.pyというファイルが作られていると思います。以下は簡単なサンプルです。

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.selector import Selector

class BaseSpider(CrawlSpider):
    name = 'sample_spider'
    # スクレイピングするドメインを指定します
    allowed_domains = ['example.com']

    # スクレイピングを開始するURL
    start_urls = ['http://example.com/']

    # ドメイン以下のURLに関して正規表現で指定します
    # /123など数字が連続しているURLはスクレイピングします
    allow_list = ['/\d+']
    # category, tag, pageなどが含まれるページはスクレイピングしません
    deny_list = ['category', 'tag', 'page']

    rules = (
        # スクレイピングするURLのルールを指定
        # parse_item関数がコールバック
        Rule(LinkExtractor( allow=allow_list, deny=deny_list ), callback='parse_item'),

        # spiderがたどるURLを指定
        # 今回はallowed_domainsの場合は全てたどります。
        Rule(LinkExtractor(), follow=True),
    )

    def parse_item(self, response):
        i = {}
        selector = Selector(response)
        i['URL'] = response.url
        i['title'] = selector.xpath('/html/head/title/text()').extract()
        return i

sitemap.xmlからスクレイピングをする場合は以下のようになります。


# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from scrapy.spiders import SitemapSpider

class SitemapSpider(SitemapSpider):
    name = 'sitemap'
    allowed_domains = ['www.green-japan.com']
    sitemap_urls = ['https://www.green-japan.com/sitemap/companies.xml']

    sitemap_rules = [
        (r'/company/', 'parse_item'),
    ]

    def parse_item(self, response):
        i = {}
        selector = Selector(response)

        i['url'] = response.url
        i['title'] = selector.xpath('/html/head/title/text()').extract()
        return ii

3-4. スクレイピングの実行

以下のコマンドでスクレイピングを実行します。実行結果は、test.csvに出力されます。

$ scrapy crawl sample_spider -o test.csv

クロールをはじめると、以下のような表示がされます。

▼クロール除外時
2015-12-11 16:21:46 [scrapy] DEBUG: Filtered offsite request to 'www.not-target.com': 

▼クロール時
2015-12-11 16:21:49 [scrapy] DEBUG: Crawled (200)  (referer: http://sample.com/)

また、ターミナルを閉じてでもバックグラウンドで実行したい場合は、nohupコマンドで使います。以下はバックグラウンドで実行し、title.csvに実行結果を吐き出す例です。

$ nohup scrapy crawl sample_spidersimple -o titles.csv &
  [1] 11564
  nohup: ignoring input and appending output to `nohup.out'

実行結果は以下のコマンドで確認できます。

$ tail -f nohup.out

【参考】 Scrapy 1.0が公開されました ScrapyでWebサイトのタイトルとURLを再帰的に取得する scrapy を用いてデータを収集し、mongoDB に投入する