Python, feedparserでRSS, Atomフィードを解析

Posted: 2018-06-30 | Tags: Python

Pythonのサードパーティライブラリfeedparserを使うと、RSS / Atomフィードを解析（パース）してサイトの新着記事などの情報を抽出できる。フォーマットの仕様の違いをfeedparserが吸収してくれるので、どんなフォーマットでも同じように扱える。

ここでは以下の内容について説明する。サンプルコードのfeedparserのバージョンは5.2.1。

フィードのフォーマット
feedparserのインストール
feedparserの基本的な使い方
feedparserで取得できる情報
フィードから新着記事のURL・タイトルのリストを抽出

フィードのフォーマット

ウェブサイトのコンテンツの概要を配信するフィードのフォーマットは複数存在する。

現在は、

RSS 1.0
- RDF Site Summary (RSS) 1.0
RSS 2.0
- RSS 2.0 Specification (version 2.0.11)
Atom
- RFC 4287 - The Atom Syndication Format

の3つのフォーマットが主に使われている。

feedparserはすべてのフォーマットに対応している。

feedparserのインストール

pip（環境によってはpip3）でインストールできる。

$ pip install feedparser

feedparserの基本的な使い方

技術評論社のウェブサイト（http://gihyo.jp）のフィードを例とする。2018年6月時点でRSS1.0 / RSS2.0 / ATOMすべてのフォーマットで配信されている。

RSS/ATOMフィードについて｜gihyo.jp … 技術評論社

各ライブラリをインポート。timeは日時情報の処理、pprintは出力を見やすくするために使う。

関連記事: Pythonのpprintの使い方（リストや辞書を整形して出力）

import feedparser
import pprint
import time

print(feedparser.__version__)
# 5.2.1

source: feedparser_example.py

feedparser.parse()に対象のフィードのURLを渡すとFeedParserDictオブジェクトが取得できる。pprintで省略して表示している。

d_atom = feedparser.parse('http://gihyo.jp/feed/atom')

print(type(d_atom))
# <class 'feedparser.FeedParserDict'>

pprint.pprint(d_atom, depth=1)
# {'bozo': 0,
#  'encoding': 'UTF-8',
#  'entries': [...],
#  'feed': {...},
#  'headers': {...},
#  'href': 'http://gihyo.jp/feed/atom',
#  'namespaces': {...},
#  'status': 200,
#  'updated': 'Sat, 30 Jun 2018 07:22:01 GMT',
#  'updated_parsed': time.struct_time(tm_year=2018, tm_mon=6, tm_mday=30, tm_hour=7, tm_min=22, tm_sec=1, tm_wday=5, tm_yday=181, tm_isdst=0),
#  'version': 'atom10'}

source: feedparser_example.py

FeedParserDictは辞書（dict型オブジェクト）のようにキーを指定して値を取得したり、get()メソッドやkeys()メソッドを使ったりできる。

関連記事: Pythonの辞書のgetメソッドでキーから値を取得（存在しないキーでもOK）
関連記事: Pythonの辞書（dict）のforループ処理（keys, values, items）

print(d_atom['encoding'])
# UTF-8

print(d_atom.get('encoding'))
# UTF-8

print(list(d_atom.keys()))
# ['feed', 'entries', 'bozo', 'headers', 'updated', 'updated_parsed', 'href', 'status', 'encoding', 'version', 'namespaces']

source: feedparser_example.py

その他のフォーマットのURLでも同様にFeedParserDictオブジェクトが取得できる。

RSS 1.0。

d_rss1 = feedparser.parse('http://gihyo.jp/feed/rss1')

print(type(d_rss1))
# <class 'feedparser.FeedParserDict'>

pprint.pprint(d_rss1, depth=1)
# {'bozo': 0,
#  'encoding': 'UTF-8',
#  'entries': [...],
#  'feed': {...},
#  'headers': {...},
#  'href': 'http://gihyo.jp/feed/rss1',
#  'namespaces': {...},
#  'status': 200,
#  'updated': 'Sat, 30 Jun 2018 07:22:01 GMT',
#  'updated_parsed': time.struct_time(tm_year=2018, tm_mon=6, tm_mday=30, tm_hour=7, tm_min=22, tm_sec=1, tm_wday=5, tm_yday=181, tm_isdst=0),
#  'version': 'rss10'}

source: feedparser_example.py

RSS 2.0。

d_rss2 = feedparser.parse('http://gihyo.jp/feed/rss2')

print(type(d_rss2))
# <class 'feedparser.FeedParserDict'>

pprint.pprint(d_rss2, depth=1)
# {'bozo': 0,
#  'encoding': 'UTF-8',
#  'entries': [...],
#  'feed': {...},
#  'headers': {...},
#  'href': 'http://gihyo.jp/feed/rss2',
#  'namespaces': {},
#  'status': 200,
#  'updated': 'Sat, 30 Jun 2018 07:22:01 GMT',
#  'updated_parsed': time.struct_time(tm_year=2018, tm_mon=6, tm_mday=30, tm_hour=7, tm_min=22, tm_sec=1, tm_wday=5, tm_yday=181, tm_isdst=0),
#  'version': 'rss20'}

source: feedparser_example.py

このように、フィードの仕様を気にすることなく同じFeedParserDictオブジェクトとして扱うことができる。

feedparserで取得できる情報

FeedParserDictに格納されている情報の詳細は以下の公式リファレンスを参照。FeedParserDictの各要素の値がRSS1.0 / RSS2.0 / ATOMのどの値を抽出したものなのかについての記載もある。

Reference — feedparser 5.2.0 documentation

ここでは使用頻度が高いfeedキーとentriesキーの情報について説明する。

feed

feedキーにはフィード自体の情報が格納されている。feedキーの値もFeedParserDictとなっている。

feed = feedparser.parse('http://gihyo.jp/feed/atom')['feed']

print(type(feed))
# <class 'feedparser.FeedParserDict'>

pprint.pprint(feed)
# {'author': '技術評論社',
#  'author_detail': {'name': '技術評論社'},
#  'authors': [{'name': '技術評論社'}],
#  'guidislink': True,
#  'icon': 'http://gihyo.jp/assets/templates/gihyojp2007/image/header_logo_gihyo.gif',
#  'id': 'http://gihyo.jp/',
#  'link': 'http://gihyo.jp/',
#  'links': [{'href': 'http://gihyo.jp/',
#             'rel': 'alternate',
#             'type': 'text/html'}],
#  'rights': '技術評論社 2018',
#  'rights_detail': {'base': 'http://gihyo.jp/feed/atom',
#                    'language': None,
#                    'type': 'text/plain',
#                    'value': '技術評論社 2018'},
#  'subtitle': 'gihyo.jp（総合）の更新情報をお届けします',
#  'subtitle_detail': {'base': 'http://gihyo.jp/feed/atom',
#                      'language': None,
#                      'type': 'text/plain',
#                      'value': 'gihyo.jp（総合）の更新情報をお届けします'},
#  'title': 'gihyo.jp：総合',
#  'title_detail': {'base': 'http://gihyo.jp/feed/atom',
#                   'language': None,
#                   'type': 'text/plain',
#                   'value': 'gihyo.jp：総合'},
#  'updated': '2018-06-30T16:22:01+09:00',
#  'updated_parsed': time.struct_time(tm_year=2018, tm_mon=6, tm_mday=30, tm_hour=7, tm_min=22, tm_sec=1, tm_wday=5, tm_yday=181, tm_isdst=0)}

source: feedparser_example.py

'updated'はフィードの更新日時の文字列。

print(feed['updated'])
# 2018-06-30T16:22:01+09:00

print(type(feed['updated']))
# <class 'str'>

source: feedparser_example.py

'updated_parsed'はフィードの更新日時のtime.struct_timeオブジェクト。

time.struct_timeオブジェクトは属性で年、月、日などを取得したり、標準ライブラリのtimeモジュールのstrftime()で任意の書式の文字列に変換したりできる。

t = feed['updated_parsed']

print(t)
# time.struct_time(tm_year=2018, tm_mon=6, tm_mday=30, tm_hour=7, tm_min=22, tm_sec=1, tm_wday=5, tm_yday=181, tm_isdst=0)

print(type(t))
# <class 'time.struct_time'>

print(t.tm_year)
# 2018

print(t.tm_mon)
# 6

print(t.tm_mday)
# 30

print(time.strftime('%Y-%m-%d %H:%M:%S', t))
# 2018-06-30 07:22:01

source: feedparser_example.py

この例では、'updated'がJST（日本標準時）、'updated_parsed'がGMT（グリニッジ標準時）の時刻になっている。

entries

entriesキーはFeedParserDictを要素とするリストとなっており、それぞれのFeedParserDictにコンテンツの詳細情報が格納されている。

entries = feedparser.parse('http://gihyo.jp/feed/atom')['entries']

print(type(entries))
# <class 'list'>

print(len(entries))
# 20

entry = entries[0]

print(type(entry))
# <class 'feedparser.FeedParserDict'>

pprint.pprint(entry)
# {'author': '階戸アキラ',
#  'author_detail': {'name': '階戸アキラ'},
#  'authors': [{'name': '階戸アキラ'}],
#  'guidislink': False,
#  'id': 'http://gihyo.jp/admin/clip/01/linux_dt/201806/29',
#  'link': 'http://gihyo.jp/admin/clip/01/linux_dt/201806/29',
#  'links': [{'href': 'http://gihyo.jp/admin/clip/01/linux_dt/201806/29',
#             'rel': 'alternate',
#             'type': 'text/html'}],
#  'published': '2018-06-29T15:46:00+09:00',
#  'published_parsed': time.struct_time(tm_year=2018, tm_mon=6, tm_mday=29, tm_hour=6, tm_min=46, tm_sec=0, tm_wday=4, tm_yday=180, tm_isdst=0),
#  'summary': 'Gentoo '
#             'Linuxは6月28日（世界標準時），同日20時20分に正体不明の何者かによってGitHubのページのコントロールが奪われたことを明らかにした。',
#  'summary_detail': {'base': 'http://gihyo.jp/feed/atom',
#                     'language': None,
#                     'type': 'text/plain',
#                     'value': 'Gentoo '
#                              'Linuxは6月28日（世界標準時），同日20時20分に正体不明の何者かによってGitHubのページのコントロールが奪われたことを明らかにした。'},
#  'tags': [{'label': None,
#            'scheme': 'http://gihyo.jp/admin/clip/01/linux_dt',
#            'term': 'Linux Daily Topics'}],
#  'title': '2018年6月29日\u3000Gentoo，GitHubリポジトリを不正ハックされる ── Linux Daily Topics',
#  'title_detail': {'base': 'http://gihyo.jp/feed/atom',
#                   'language': None,
#                   'type': 'text/plain',
#                   'value': '2018年6月29日\u3000Gentoo，GitHubリポジトリを不正ハックされる ── '
#                            'Linux Daily Topics'},
#  'updated': '2018-06-29T15:46:00+09:00',
#  'updated_parsed': time.struct_time(tm_year=2018, tm_mon=6, tm_mday=29, tm_hour=6, tm_min=46, tm_sec=0, tm_wday=4, tm_yday=180, tm_isdst=0)}

source: feedparser_example.py

各コンテンツのURL（link）、タイトル（title）、要約（summary）などの情報が含まれている。

なお、すべてのサイトが完全なフィードを配信しているわけではなく、要約などが含まれていない場合もあるので注意。

フィードから新着記事のURL・タイトルのリストを抽出

feedparserの具体的な活用例として、フィードから新着記事のURL・タイトルのリストを抽出する例を示す。

URLのリスト

リスト内包表記でentriesの各エントリーからlinkのURLを取り出す。

関連記事: Pythonリスト内包表記の使い方

d = feedparser.parse('http://gihyo.jp/feed/atom')

urls = [entry['link'] for entry in d['entries']]

pprint.pprint(urls)
# ['http://gihyo.jp/admin/clip/01/linux_dt/201806/29',
#  'http://gihyo.jp/admin/clip/01/ubuntu-topics/201806/29',
#  'http://gihyo.jp/book/pickup/2018/0044',
#  'http://gihyo.jp/book/pickup/2018/0043',
#  'http://gihyo.jp/news/info/2018/06/2801',
#  'http://gihyo.jp/news/nr/2018/06/2801',
#  'http://gihyo.jp/dev/serial/01/continue-power/0012',
#  'http://gihyo.jp/lifestyle/clip/01/awt/201806/28',
#  'http://gihyo.jp/design/clip/01/weekly-web-tech/201806/28',
#  'http://gihyo.jp/book/pickup/2018/0042',
#  'http://gihyo.jp/book/pickup/2018/0041',
#  'http://gihyo.jp/admin/serial/01/ubuntu-recipe/0525',
#  'http://gihyo.jp/dev/serial/01/funny-play-pb/0007',
#  'http://gihyo.jp/book/pickup/2018/0040',
#  'http://gihyo.jp/book/pickup/2018/0039',
#  'http://gihyo.jp/news/info/2018/06/36908',
#  'http://gihyo.jp/news/info/2018/06/36903',
#  'http://gihyo.jp/admin/clip/01/linux_dt/201806/26',
#  'http://gihyo.jp/lifestyle/serial/01/ganshiki-soushi/0099',
#  'http://gihyo.jp/dev/serial/01/mysql-road-construction-news/0074']

source: feedparser_example.py

タイトルのリスト

タイトルの場合も同じ。

titles = [entry['title'] for entry in d['entries']]

pprint.pprint(titles)
# ['2018年6月29日\u3000Gentoo，GitHubリポジトリを不正ハックされる ── Linux Daily Topics',
#  '2018年6月29日号\u3000CanonicalのUbuntu Desktop調査，Spectre/Meltdown対策さらにさらにその後・AMD編 '
#  '── Ubuntu Weekly Topics',
#  'Alexaスキル開発の勘所―進化し続けるAlexaの“今”を知る！ ── 新刊ピックアップ',
#  'IT技術変革の軌跡～変わることと変わらないこと～ ── 新刊ピックアップ',
#  '「Python Boot Camp」7/21に茨城県つくば市で開催 ── インフォメーション',
#  'ヌーラボ，オンライン描画ツール「Cacoo」のUIを全面刷新――全世界300万人のユーザから得たUXリサーチ結果を反映 ── ニュースリリース',
#  '最終回\u3000エンジニアはどこに行くのか ── 継続は力なり―大器晩成エンジニアを目指して',
#  '2018年6月第5週\u3000Googleがポッドキャストへ再参入 ── Android Weekly Topics',
#  '2018年6月第4週号 '
#  '1位は，デザイン作業の段階に分けておすすめのUXツールを紹介，気になるネタは，Instagram、YouTubeに対抗する長尺動画サービス「IGTV」提供開始 '
#  '── 週刊Webテク通信',
#  'デジ絵をはじめるなら「クリスタ」で決まり！ ── 新刊ピックアップ',
#  'ほぼほぼ理解！ ブロックチェーンの何が「スゴイ」のか？ ── 新刊ピックアップ',
#  '第525回\u3000Ubuntu 18.04 LTSリリース記念オフラインミーティング フォトレポート ── Ubuntu Weekly Recipe',
#  'File.#007\u3000社内プチミステリ（連載第69回） ── きたみりゅうじの聞かせて珍プレー プレイバック',
#  '正しいコードの書き方とは？～ウェブ業界の即戦力となるHTMLとCSSの記述方法を身につけよう！ ── 新刊ピックアップ',
#  '小さな会社やお店の販促ツールが無料で作れる！ Canvaを始めよう！ ── 新刊ピックアップ',
#  '書籍『統計思考の世界』『系統体系学の世界』刊行記念トークイベント， 7月20日にゲンロンカフェで開催 ── インフォメーション',
#  'Ruby bizグランプリ2018募集開始\u3000応募は9月14日まで ── インフォメーション',
#  '2018年6月26日\u3000Kubernetesこそ未来 ―GitLab，プラットフォームをAzureからGCPへ移行 ── Linux Daily '
#  'Topics',
#  '第99回\u3000Plamo-7.0とSysvinit ── 玩式草子─ソフトウェアとたわむれる日々',
#  '第74回\u3000さまざまなMySQLのバージョンを試す ── MySQL道普請便り']

source: feedparser_example.py

URLとタイトルの辞書のリスト

URLとタイトルの情報を含む辞書のリストにすることも可能。

dicts = [{'url': e['link'], 'title': e['title']} for e in d['entries']]

pprint.pprint(dicts)
# [{'title': '2018年6月29日\u3000Gentoo，GitHubリポジトリを不正ハックされる ── Linux Daily Topics',
#   'url': 'http://gihyo.jp/admin/clip/01/linux_dt/201806/29'},
#  {'title': '2018年6月29日号\u3000CanonicalのUbuntu '
#            'Desktop調査，Spectre/Meltdown対策さらにさらにその後・AMD編 ── Ubuntu Weekly Topics',
#   'url': 'http://gihyo.jp/admin/clip/01/ubuntu-topics/201806/29'},
#  {'title': 'Alexaスキル開発の勘所―進化し続けるAlexaの“今”を知る！ ── 新刊ピックアップ',
#   'url': 'http://gihyo.jp/book/pickup/2018/0044'},
#  {'title': 'IT技術変革の軌跡～変わることと変わらないこと～ ── 新刊ピックアップ',
#   'url': 'http://gihyo.jp/book/pickup/2018/0043'},
#  {'title': '「Python Boot Camp」7/21に茨城県つくば市で開催 ── インフォメーション',
#   'url': 'http://gihyo.jp/news/info/2018/06/2801'},
#  {'title': 'ヌーラボ，オンライン描画ツール「Cacoo」のUIを全面刷新――全世界300万人のユーザから得たUXリサーチ結果を反映 ── '
#            'ニュースリリース',
#   'url': 'http://gihyo.jp/news/nr/2018/06/2801'},
#  {'title': '最終回\u3000エンジニアはどこに行くのか ── 継続は力なり―大器晩成エンジニアを目指して',
#   'url': 'http://gihyo.jp/dev/serial/01/continue-power/0012'},
#  {'title': '2018年6月第5週\u3000Googleがポッドキャストへ再参入 ── Android Weekly Topics',
#   'url': 'http://gihyo.jp/lifestyle/clip/01/awt/201806/28'},
#  {'title': '2018年6月第4週号 '
#            '1位は，デザイン作業の段階に分けておすすめのUXツールを紹介，気になるネタは，Instagram、YouTubeに対抗する長尺動画サービス「IGTV」提供開始 '
#            '── 週刊Webテク通信',
#   'url': 'http://gihyo.jp/design/clip/01/weekly-web-tech/201806/28'},
#  {'title': 'デジ絵をはじめるなら「クリスタ」で決まり！ ── 新刊ピックアップ',
#   'url': 'http://gihyo.jp/book/pickup/2018/0042'},
#  {'title': 'ほぼほぼ理解！ ブロックチェーンの何が「スゴイ」のか？ ── 新刊ピックアップ',
#   'url': 'http://gihyo.jp/book/pickup/2018/0041'},
#  {'title': '第525回\u3000Ubuntu 18.04 LTSリリース記念オフラインミーティング フォトレポート ── Ubuntu '
#            'Weekly Recipe',
#   'url': 'http://gihyo.jp/admin/serial/01/ubuntu-recipe/0525'},
#  {'title': 'File.#007\u3000社内プチミステリ（連載第69回） ── きたみりゅうじの聞かせて珍プレー プレイバック',
#   'url': 'http://gihyo.jp/dev/serial/01/funny-play-pb/0007'},
#  {'title': '正しいコードの書き方とは？～ウェブ業界の即戦力となるHTMLとCSSの記述方法を身につけよう！ ── 新刊ピックアップ',
#   'url': 'http://gihyo.jp/book/pickup/2018/0040'},
#  {'title': '小さな会社やお店の販促ツールが無料で作れる！ Canvaを始めよう！ ── 新刊ピックアップ',
#   'url': 'http://gihyo.jp/book/pickup/2018/0039'},
#  {'title': '書籍『統計思考の世界』『系統体系学の世界』刊行記念トークイベント， 7月20日にゲンロンカフェで開催 ── インフォメーション',
#   'url': 'http://gihyo.jp/news/info/2018/06/36908'},
#  {'title': 'Ruby bizグランプリ2018募集開始\u3000応募は9月14日まで ── インフォメーション',
#   'url': 'http://gihyo.jp/news/info/2018/06/36903'},
#  {'title': '2018年6月26日\u3000Kubernetesこそ未来 ―GitLab，プラットフォームをAzureからGCPへ移行 ── '
#            'Linux Daily Topics',
#   'url': 'http://gihyo.jp/admin/clip/01/linux_dt/201806/26'},
#  {'title': '第99回\u3000Plamo-7.0とSysvinit ── 玩式草子─ソフトウェアとたわむれる日々',
#   'url': 'http://gihyo.jp/lifestyle/serial/01/ganshiki-soushi/0099'},
#  {'title': '第74回\u3000さまざまなMySQLのバージョンを試す ── MySQL道普請便り',
#   'url': 'http://gihyo.jp/dev/serial/01/mysql-road-construction-news/0074'}]

print(dicts[0]['url'])
# http://gihyo.jp/admin/clip/01/linux_dt/201806/29

print(dicts[0]['title'])
# 2018年6月29日　Gentoo，GitHubリポジトリを不正ハックされる ── Linux Daily Topics

source: feedparser_example.py

全角スペースについての注意

これまでの例の出力を見ると、ところどころに\u3000という文字列が含まれている。

\u3000は全角スペースで、print()では全角スペースとして出力される。

print('\u3000' == '　')
# True

title = d['entries'][0]['title']

print(repr(title))
# '2018年6月29日\u3000Gentoo，GitHubリポジトリを不正ハックされる ── Linux Daily Topics'

print(title)
# 2018年6月29日　Gentoo，GitHubリポジトリを不正ハックされる ── Linux Daily Topics

source: feedparser_example.py

replace()で半角スペースに置換できる。

print(title.replace('\u3000', ' '))
# 2018年6月29日 Gentoo，GitHubリポジトリを不正ハックされる ── Linux Daily Topics

source: feedparser_example.py

リスト内包表記の中でも置換可能。

titles_space = [entry['title'].replace('\u3000', ' ') for entry in d['entries']]

pprint.pprint(titles_space)
# ['2018年6月29日 Gentoo，GitHubリポジトリを不正ハックされる ── Linux Daily Topics',
#  '2018年6月29日号 CanonicalのUbuntu Desktop調査，Spectre/Meltdown対策さらにさらにその後・AMD編 ── '
#  'Ubuntu Weekly Topics',
#  'Alexaスキル開発の勘所―進化し続けるAlexaの“今”を知る！ ── 新刊ピックアップ',
#  'IT技術変革の軌跡～変わることと変わらないこと～ ── 新刊ピックアップ',
#  '「Python Boot Camp」7/21に茨城県つくば市で開催 ── インフォメーション',
#  'ヌーラボ，オンライン描画ツール「Cacoo」のUIを全面刷新――全世界300万人のユーザから得たUXリサーチ結果を反映 ── ニュースリリース',
#  '最終回 エンジニアはどこに行くのか ── 継続は力なり―大器晩成エンジニアを目指して',
#  '2018年6月第5週 Googleがポッドキャストへ再参入 ── Android Weekly Topics',
#  '2018年6月第4週号 '
#  '1位は，デザイン作業の段階に分けておすすめのUXツールを紹介，気になるネタは，Instagram、YouTubeに対抗する長尺動画サービス「IGTV」提供開始 '
#  '── 週刊Webテク通信',
#  'デジ絵をはじめるなら「クリスタ」で決まり！ ── 新刊ピックアップ',
#  'ほぼほぼ理解！ ブロックチェーンの何が「スゴイ」のか？ ── 新刊ピックアップ',
#  '第525回 Ubuntu 18.04 LTSリリース記念オフラインミーティング フォトレポート ── Ubuntu Weekly Recipe',
#  'File.#007 社内プチミステリ（連載第69回） ── きたみりゅうじの聞かせて珍プレー プレイバック',
#  '正しいコードの書き方とは？～ウェブ業界の即戦力となるHTMLとCSSの記述方法を身につけよう！ ── 新刊ピックアップ',
#  '小さな会社やお店の販促ツールが無料で作れる！ Canvaを始めよう！ ── 新刊ピックアップ',
#  '書籍『統計思考の世界』『系統体系学の世界』刊行記念トークイベント， 7月20日にゲンロンカフェで開催 ── インフォメーション',
#  'Ruby bizグランプリ2018募集開始 応募は9月14日まで ── インフォメーション',
#  '2018年6月26日 Kubernetesこそ未来 ―GitLab，プラットフォームをAzureからGCPへ移行 ── Linux Daily '
#  'Topics',
#  '第99回 Plamo-7.0とSysvinit ── 玩式草子─ソフトウェアとたわむれる日々',
#  '第74回 さまざまなMySQLのバージョンを試す ── MySQL道普請便り']

source: feedparser_example.py

関連カテゴリー

関連記事