scrapy – Linux & Android Dialy

scrapy

scrapy は
スクレイピング
クローリングの
フレームワーク

インストールは

1	`pip3` `install` `scrapy`

これをつかってはてな匿名ダイアリーのプロジェクトを作成

通称は
増田
らしい

このURLである
https://anond.hatelabo.jp/
をクロールする

1	`scrapy startproject anond`

というように
scrapy startproject プロジェクト名
で
プロジェクトを作成

1	`cd` `anond/`

で移動

1	`sudo` `apt` `install` `tree`

で
treeコマンドをインストール

tree

で構成をみると

.
├── anond
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       └── __init__.py
└── scrapy.cfg

となっているのがわかる

次に
scrapy に
はてな匿名ダイアリーを設定

settings.py でクロール間隔の調整
デフォルトだと間隔０秒で最大１６リクエストになるので
負荷がかかる

なので平均１秒あけるようにしてダウンロードするようにする

1	`vim anond/settings.py`

でファイルを開き
２８行目の

1	`#DOWNLOAD_DELAY = 3`

の部分を

1	`DOWNLOAD_DELAY = 1`

と変更し保存

次にitem.py を編集し
URLのみ取得するように
url = scrapy.Field() を追記するので

1	`vim anond/items.py`

でファイルをひらき

class AnondItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

の部分を

class AnondItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    url = scrapy.Field()
    pass

として保存

次に Spider の作成

コマンド実行時に
第１引数に　スパイダーの名前
今回は abibd_spider

第２引数に　ドメインを指定
今回なら anond.hatelabo.jp
となる

1	`scrapy genspider anond_spider anond.hatelabo.jp`

実行したあとに

tree

を実行すると

.
├── anond
│   ├── __init__.py
│   ├── __pycache__
│   │   ├── __init__.cpython-36.pyc
│   │   └── settings.cpython-36.pyc
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       ├── __pycache__
│       │   └── __init__.cpython-36.pyc
│       └── anond_spider.py
└── scrapy.cfg

という構造になっているのがわかる

次にはてな匿名ダイアリーのURL抽出処理の実装

1	`vim anond/spiders/anond_spider.py`

でファイルをひらき

７行目のURLが http になっているので
https に変更する

1	`start_urls` `=` `['https://anond.hatelabo.jp/']`

次にパースしたときの処理を追記
今回はパーマリンクURL をたどるように設定

９行目からの

def parse(self, response):
    pass

を

def parse(self, response):
    for url in response.css('p.sectionfooter a::attr("href")'):
        yield response.follow(url)
    pass

へ変更

これで準備できたので実行

1	`scrapy crawl anond_spider`

これでクローリングが実行され
URLの取得ができる

参考書籍は

<br />

なお kindle Fire でみるときには
拡大しなくても見れるので
１０インチがおすすめ

<br />

カバーがほしい場合には
マグネット機能で閉じたらOFFにしてくれる純正がおすすめ

<br />