PythonでarXiv APIを使って論文情報取得、PDFダウンロード

Modified: 2019-11-17 | Tags: Python, Web API

arXivは物理学や数学、コンピューターサイエンスなどの論文が公開されているウェブサイト、プレプリント・サーバー。

PythonでarXiv APIを利用して論文情報（メタデータ）を取得したり、論文のPDFをダウンロードしたりする方法、および、RSSを利用して最新情報を取得する方法について説明する。

arXiv APIのPythonラッパー: arxiv
検索条件を指定して論文情報（メタデータ）を取得
- arxiv.query()の基本的な使い方
- arxiv.query()の引数
- 具体例
論文のPDFをダウンロード（個別・一括）
arXivのRSSを取得

arXiv APIのPythonラッパー: arxiv

arXivは論文のデータにアクセスするためのAPIを公開している。

ここでは、arXiv APIのPythonのラッパーであるライブラリarxivを使う。

lukasschwab/arxiv.py: Python wrapper for the arXiv API

arxivはpip（環境によってはpip3）でインストールできる。

$ pip install arxiv

以下の情報はarxivのバージョン0.5.1のもの。

検索条件を指定して論文情報（メタデータ）を取得

arxiv.query()の基本的な使い方

arxiv.query()で検索条件を指定して論文情報（メタデータ）を取得できる。各論文の情報が格納されたFeedParserDictを要素とするリストが返される。

author（著者）がGrisha Perelmanである論文を取得する例。引数の詳細については後述。

import pprint

import arxiv
import pandas as pd

l = arxiv.query(query='au:"Grisha Perelman"')

print(type(l))
# <class 'list'>

print(len(l))
# 3

print(type(l[0]))
# <class 'feedparser.FeedParserDict'>

source: arxiv_api.py

FeedParserDictはサードパーティライブラリfeedparserで定義された型。ここではpprintを使って見やすく出力している。

関連記事: Python, feedparserでRSS, Atomフィードを解析
関連記事: Pythonのpprintの使い方（リストや辞書を整形して出力）

pprint.pprint(l[0], width=200)
# {'affiliation': 'None',
#  'arxiv_comment': '39 pages',
#  'arxiv_primary_category': {'scheme': 'http://arxiv.org/schemas/atom', 'term': 'math.DG'},
#  'arxiv_url': 'http://arxiv.org/abs/math/0211159v1',
#  'author': 'Grisha Perelman',
#  'author_detail': {'name': 'Grisha Perelman'},
#  'authors': ['Grisha Perelman'],
#  'doi': None,
#  'guidislink': True,
#  'id': 'http://arxiv.org/abs/math/0211159v1',
#  'journal_reference': None,
#  'links': [{'href': 'http://arxiv.org/abs/math/0211159v1', 'rel': 'alternate', 'type': 'text/html'},
#            {'href': 'http://arxiv.org/pdf/math/0211159v1', 'rel': 'related', 'title': 'pdf', 'type': 'application/pdf'}],
#  'pdf_url': 'http://arxiv.org/pdf/math/0211159v1',
#  'published': '2002-11-11T16:11:49Z',
#  'published_parsed': time.struct_time(tm_year=2002, tm_mon=11, tm_mday=11, tm_hour=16, tm_min=11, tm_sec=49, tm_wday=0, tm_yday=315, tm_isdst=0),
#  'summary': 'We present a monotonic expression for the Ricci flow, valid in all dimensions\n'
#             'and without curvature assumptions. It is interpreted as an entropy for a\n'
#             'certain canonical ensemble. Several geometric applications are given. In\n'
#             'particular, (1) Ricci flow, considered on the space of riemannian metrics\n'
#             'modulo diffeomorphism and scaling, has no nontrivial periodic orbits (that is,\n'
#             'other than fixed points); (2) In a region, where singularity is forming in\n'
#             'finite time, the injectivity radius is controlled by the curvature; (3) Ricci\n'
#             'flow can not quickly turn an almost euclidean region into a very curved one, no\n'
#             'matter what happens far away. We also verify several assertions related to\n'
#             "Richard Hamilton's program for the proof of Thurston geometrization conjecture\n"
#             'for closed three-manifolds, and give a sketch of an eclectic proof of this\n'
#             'conjecture, making use of earlier results on collapsing with local lower\n'
#             'curvature bound.',
#  'summary_detail': {'base': 'http://export.arxiv.org/api/query?search_query=au%3A%22Grisha+Perelman%22&id_list=&start=0&max_results=1000&sortBy=relevance&sortOrder=descending',
#                     'language': None,
#                     'type': 'text/plain',
#                     'value': 'We present a monotonic expression for the Ricci flow, valid in all dimensions\n'
#                              'and without curvature assumptions. It is interpreted as an entropy for a\n'
#                              'certain canonical ensemble. Several geometric applications are given. In\n'
#                              'particular, (1) Ricci flow, considered on the space of riemannian metrics\n'
#                              'modulo diffeomorphism and scaling, has no nontrivial periodic orbits (that is,\n'
#                              'other than fixed points); (2) In a region, where singularity is forming in\n'
#                              'finite time, the injectivity radius is controlled by the curvature; (3) Ricci\n'
#                              'flow can not quickly turn an almost euclidean region into a very curved one, no\n'
#                              'matter what happens far away. We also verify several assertions related to\n'
#                              "Richard Hamilton's program for the proof of Thurston geometrization conjecture\n"
#                              'for closed three-manifolds, and give a sketch of an eclectic proof of this\n'
#                              'conjecture, making use of earlier results on collapsing with local lower\n'
#                              'curvature bound.'},
#  'tags': [{'label': None, 'scheme': 'http://arxiv.org/schemas/atom', 'term': 'math.DG'}, {'label': None, 'scheme': 'http://arxiv.org/schemas/atom', 'term': '53C'}],
#  'title': 'The entropy formula for the Ricci flow and its geometric applications',
#  'title_detail': {'base': 'http://export.arxiv.org/api/query?search_query=au%3A%22Grisha+Perelman%22&id_list=&start=0&max_results=1000&sortBy=relevance&sortOrder=descending',
#                   'language': None,
#                   'type': 'text/plain',
#                   'value': 'The entropy formula for the Ricci flow and its geometric applications'},
#  'updated': '2002-11-11T16:11:49Z',
#  'updated_parsed': time.struct_time(tm_year=2002, tm_mon=11, tm_mday=11, tm_hour=16, tm_min=11, tm_sec=49, tm_wday=0, tm_yday=315, tm_isdst=0)}

source: arxiv_api.py

辞書のようにキーを指定して値を取得できる。

print(l[0]['author'])
# Grisha Perelman

print(l[0]['title'])
# The entropy formula for the Ricci flow and its geometric applications

print(l[0]['arxiv_url'])
# http://arxiv.org/abs/math/0211159v1

print(l[0]['pdf_url'])
# http://arxiv.org/pdf/math/0211159v1

print(l[0]['summary'])
# We present a monotonic expression for the Ricci flow, valid in all dimensions
# and without curvature assumptions. It is interpreted as an entropy for a
# certain canonical ensemble. Several geometric applications are given. In
# particular, (1) Ricci flow, considered on the space of riemannian metrics
# modulo diffeomorphism and scaling, has no nontrivial periodic orbits (that is,
# other than fixed points); (2) In a region, where singularity is forming in
# finite time, the injectivity radius is controlled by the curvature; (3) Ricci
# flow can not quickly turn an almost euclidean region into a very curved one, no
# matter what happens far away. We also verify several assertions related to
# Richard Hamilton's program for the proof of Thurston geometrization conjecture
# for closed three-manifolds, and give a sketch of an eclectic proof of this
# conjecture, making use of earlier results on collapsing with local lower
# curvature bound.

source: arxiv_api.py

FeedParserDictのリストから、リスト内包表記で特定のキーの値を抽出してリスト化することも可能。

関連記事: Pythonリスト内包表記の使い方

pprint.pprint([a['id'] for a in l])
# ['http://arxiv.org/abs/math/0211159v1',
#  'http://arxiv.org/abs/math/0303109v1',
#  'http://arxiv.org/abs/math/0307245v1']

pprint.pprint([[a['id'], a['published']] for a in l])
# [['http://arxiv.org/abs/math/0211159v1', '2002-11-11T16:11:49Z'],
#  ['http://arxiv.org/abs/math/0303109v1', '2003-03-10T16:44:35Z'],
#  ['http://arxiv.org/abs/math/0307245v1', '2003-07-17T15:26:38Z']]

source: arxiv_api.py

FeedParserDictは辞書とみなせるので、pd.io.json.json_normalize()でpandas.DataFrameに変換できる。

関連記事: pandasのjson_normalizeで辞書のリストをDataFrameに変換

df = pd.io.json.json_normalize(l)
print(df.shape)
# (3, 29)

print(df[['title', 'published']])
#                                                title             published
# 0  The entropy formula for the Ricci flow and its...  2002-11-11T16:11:49Z
# 1         Ricci flow with surgery on three-manifolds  2003-03-10T16:44:35Z
# 2  Finite extinction time for the solutions to th...  2003-07-17T15:26:38Z

source: arxiv_api.py

arxiv.query()の引数

arxiv.query()の引数は以下の通り。

lukasschwab/arxiv.py: Python wrapper for the arXiv API

arxiv.query(query="",
            id_list=[],
            max_results=None,
            start = 0,
            sort_by="relevance",
            sort_order="descending",
            prune=True,
            iterative=False,
            max_chunk_results=1000)

主要なものについて説明する。

検索クエリを指定: query

arXiv APIに検索クエリとして渡す文字列を引数queryに指定する。上の例のquery='au:"Grisha Perelman"'のように検索対象:検索文字列という形で指定する。AND, OR, ANDNOTで複数の条件を組み合わせることもできる。

公式ドキュメントは以下。

arXiv API User's Manual - Details of Query Construction | arXiv e-print repository

主な検索対象は以下の通り。

ti: Title
au: Author
abs: Abstract
cat: Subject Category
all: All

cs.AIやcs.CVのようなSubject Categoryの一覧は以下。

arXiv API User's Manua - Subject Classificationsl | arXiv e-print repository

なぜか公式ドキュメントには載っていないが、submittedDate:[YYYYMMDDHHMMSS TO YYYYMMDDHHMMSS]という形で投稿日時の範囲を指定可能。

search by date? - Google グループ

HH, MM, SSを省略すると00とみなされる。また、開始日時・終了日時も範囲に含まれる様子。

具体例は後述。

取得件数を指定: max_results, max_chunk_results

取得件数は引数max_resultsで指定する。デフォルトはmax_results=Noneで無制限。

すべての検索結果を取得したいのであればデフォルトのままで特に気にしなくてもよいが、検索結果が大量の場合は取得に時間がかかるので注意。

内部では引数max_chunk_resultsで指定された件数（デフォルトはmax_chunk_results=1000）ずつ、すべての検索結果を取得するまで3秒間隔で繰り返しarXiv APIを呼んでいる。

ソースコードは以下。

arxiv.py/arxiv.py at 0.5.1 · lukasschwab/arxiv.py

arXiv APIに対して一度に大量の件数をリクエストするのは非推奨なので、max_chunk_resultsをデフォルトの1000より大きくするのは避けたほうがよい。

arXiv API User's Manual - start and max_results paging | arXiv e-print repository

ソート順を指定: sort_by, sort_order

デフォルトはsort_by='relevance'で関連度順にソートされる。sort_byには'lastUpdatedDate'（最終更新日時）, 'submittedDate'（投稿日時）を指定可能。

さらに、sort_orderで'descending'（降順、デフォルト）か'ascending'（昇順）かを指定できる。

IDのリストを指定: id_list

論文のIDをリストで指定できる。

arXivの論文のIDの構造については以下の公式ドキュメントを参照。2007年4月前後で変わっている。

Understanding the arXiv identifier | arXiv e-print repository

具体例

cs.AI（コンピューターサイエンスの人工知能分野）の最新10件および最古の10件を取得する例。

l = arxiv.query(query='cat:cs.AI', max_results=10, sort_by='submittedDate')

pprint.pprint([[a['id'], a['published']] for a in l])
# [['http://arxiv.org/abs/1907.11184v1', '2019-07-25T16:45:06Z'],
#  ['http://arxiv.org/abs/1907.11112v1', '2019-07-25T14:45:04Z'],
#  ['http://arxiv.org/abs/1907.11021v1', '2019-07-25T13:15:12Z'],
#  ['http://arxiv.org/abs/1907.11007v1', '2019-07-25T12:30:08Z'],
#  ['http://arxiv.org/abs/1907.10953v1', '2019-07-25T10:36:01Z'],
#  ['http://arxiv.org/abs/1907.10952v1', '2019-07-25T10:31:34Z'],
#  ['http://arxiv.org/abs/1907.10925v1', '2019-07-25T09:34:13Z'],
#  ['http://arxiv.org/abs/1907.10914v1', '2019-07-25T09:19:30Z'],
#  ['http://arxiv.org/abs/1907.10772v1', '2019-07-24T23:28:37Z'],
#  ['http://arxiv.org/abs/1907.10761v1', '2019-07-24T22:30:04Z']]

l = arxiv.query(query='cat:cs.AI', max_results=10,
                sort_by='submittedDate', sort_order='ascending')

pprint.pprint([[a['id'], a['published']] for a in l])
# [['http://arxiv.org/abs/cs/9308101v1', '1993-08-01T00:00:00Z'],
#  ['http://arxiv.org/abs/cs/9308102v1', '1993-08-01T00:00:00Z'],
#  ['http://arxiv.org/abs/cs/9309101v1', '1993-09-01T00:00:00Z'],
#  ['http://arxiv.org/abs/cs/9311101v1', '1993-11-01T00:00:00Z'],
#  ['http://arxiv.org/abs/cs/9311102v1', '1993-11-01T00:00:00Z'],
#  ['http://arxiv.org/abs/cs/9312101v1', '1993-12-01T00:00:00Z'],
#  ['http://arxiv.org/abs/cs/9401101v1', '1994-01-01T00:00:00Z'],
#  ['http://arxiv.org/abs/cs/9402101v1', '1994-02-01T00:00:00Z'],
#  ['http://arxiv.org/abs/cs/9402102v1', '1994-02-01T00:00:00Z'],
#  ['http://arxiv.org/abs/cs/9402103v1', '1994-02-01T00:00:00Z']]

source: arxiv_api.py

最新の情報はRSSからも取得できるが、arXiv APIのほうが情報量が多い。RSSについては後述。

submittedDateで投稿日時を指定し、2019年1月に投稿されたcs.AIのすべての論文を取得する例。

20190101 TO 20190131とすると1月1日の午前0時から1月31日の午前0時までとみなされ1月31日分が含まれない。またTO 20190201とすると投稿日時が2月1日0時ちょうどのものも含まれてしまう。厳密に指定したい場合は注意。

l = arxiv.query(query='cat:cs.AI AND submittedDate:[20190101 TO 20190131235959]',
                sort_by='submittedDate', sort_order='ascending')

df = pd.io.json.json_normalize(l)
print(df.shape)
# (298, 29)

print(df.head()[['id', 'published']])
#                                   id             published
# 0  http://arxiv.org/abs/1901.00073v1  2019-01-01T01:22:19Z
# 1  http://arxiv.org/abs/1901.00117v1  2019-01-01T08:50:47Z
# 2  http://arxiv.org/abs/1901.00158v2  2019-01-01T14:41:17Z
# 3  http://arxiv.org/abs/1901.01851v1  2019-01-01T18:05:43Z
# 4  http://arxiv.org/abs/1901.00204v1  2019-01-01T20:02:38Z

print(df.tail()[['id', 'published']])
#                                     id             published
# 293  http://arxiv.org/abs/1902.00045v1  2019-01-31T19:33:13Z
# 294  http://arxiv.org/abs/1902.00098v1  2019-01-31T22:14:34Z
# 295  http://arxiv.org/abs/1902.03092v1  2019-01-31T22:26:56Z
# 296  http://arxiv.org/abs/1902.00120v1  2019-01-31T23:10:31Z
# 297  http://arxiv.org/abs/1902.00137v2  2019-01-31T23:59:34Z

source: arxiv_api.py

上の条件に加えて、さらにタイトルにdeep learningという語句を含む論文を抽出する例。

l = arxiv.query(query='cat:cs.AI AND ti:"deep learning" AND submittedDate:[20190101 TO 20190131235959]',
                sort_by='submittedDate', sort_order='ascending')

df = pd.io.json.json_normalize(l)
print(df[['title', 'published']])
#                                                title             published
# 0  Augmentation Scheme for Dealing with Imbalance...  2019-01-01T20:02:38Z
# 1  Geometrization of deep networks for the interp...  2019-01-06T14:32:45Z
# 2  Deep Learning for Human Affect Recognition: In...  2019-01-09T23:33:47Z
# 3  Automatic Surface Area and Volume Prediction o...  2019-01-15T17:26:43Z
# 4  Fast Deep Learning for Automatic Modulation Cl...  2019-01-16T01:15:50Z
# 5  DLocRL: A Deep Learning Pipeline for Fine-Grai...  2019-01-21T17:36:19Z
# 6  DF-SLAM: A Deep-Learning Enhanced Visual SLAM ...  2019-01-22T09:25:08Z
# 7            Deep learning Inversion of Seismic Data  2019-01-23T05:51:05Z
# 8  Proceedings of AAAI 2019 Workshop on Network I...  2019-01-25T10:12:23Z

source: arxiv_api.py

引数id_listで論文のIDを指定する例。IDはそれぞれの論文のURLの末尾の文字列。

l = arxiv.query(id_list=['1902.00358v2', '1902.00358', 'math/0211159v1'])

for a in l:
    print(a['arxiv_url'])
# http://arxiv.org/abs/1902.00358v2
# http://arxiv.org/abs/1902.00358v2
# http://arxiv.org/abs/math/0211159v1

source: arxiv_api.py

上述のように、arXivのIDの付け方は2007年4月に変わっており、2007年4月以降はYYMM.<number>、それより前はカテゴリを含む<category>/YYMM<number>という形となる。バージョンを示す末尾のv1やv2を省略すると最新版とみなされる模様。

論文のPDFをダウンロード（個別・一括）

arxivには論文のPDFファイルをダウンロードするための関数arxiv.download()が用意されている。

arxiv.py/arxiv.py at master · lukasschwab/arxiv.py

第一引数にarxiv.query()で取得した個々の論文のFeedParserDict、第二引数に保存するディレクトリをカレントディレクトリからの相対パスで指定する。第二引数を省略した場合はカレントディレクトリに保存される。

import arxiv
import time

l = arxiv.query(query='au:"Grisha Perelman"')

arxiv.download(l[0], 'data/temp/')
# 'data/temp/0211159v1.The_entropy_formula_for_the_Ricci_flow_and_its_geometric_applications.pdf'

source: arxiv_download.py

上の例のように、デフォルトでは<論文のID>.<論文のタイトル>.pdf（空白は_で置換）というファイル名となる。

任意のファイル名にしたい場合は第三引数に関数を指定する。以下はIDのみとする例。

arxiv.download(l[0], 'data/temp/', lambda x: x.get('id').split('/')[-1])
# 'data/temp/0211159v1.pdf'

source: arxiv_download.py

検索結果を一括でダウンロードしたい場合は単純にfor文で繰り返せばよい。

for a in l:
    arxiv.download(a, 'data/temp/')
    time.sleep(10)

source: arxiv_download.py

ここではサーバーの負荷を低減するためにtime.sleep()でダウンロード間隔をあけている。上の例では10秒だが、これはあくまでも例。10秒あければOKということではない。

arXivではIndiscriminate automated downloads（無差別の自動ダウンロード）は許可されていない。

Indiscriminate automated downloads from this site are not permitted.
Robots Beware | arXiv e-print repository

特に数値で示されているわけではないが、大量の論文を短時間でダウンロードするような行為は避けたほうがいいだろう。手動でダウンロードするくらい十分に間隔をあければ問題ないとは思うが、あくまでも自己責任。

arXivの全論文をダウンロードしたいというような場合はAmazon S3を使ったBulk Data Accessという仕組みが提供されている。リクエスタ（ダウンロードする側）が料金を支払うプランになっている。

arXiv Bulk Data Access - Amazon S3 | arXiv e-print repository

arXivのRSSを取得

arXivはRSSフィードも提供している。

RSS news feeds for arXiv updates | arXiv e-print repository

URLはhttp://arxiv.org/rss/csやhttp://arxiv.org/rss/cs.AIのようなhttp://arxiv.org/rss/<category>という形。カテゴリー一覧は以下。

arXiv API User's Manua - Subject Classificationsl | arXiv e-print repository

Pythonのライブラリfeedparserを使うとRSSフィードを簡単に処理できる。

import pprint
import feedparser

url = 'http://arxiv.org/rss/cs.CV'

d = feedparser.parse(url)

pprint.pprint(d, depth=1)
# {'bozo': 0,
#  'encoding': 'us-ascii',
#  'entries': [...],
#  'etag': '"Fri, 26 Jul 2019 00:30:00 GMT", "1564101000"',
#  'feed': {...},
#  'headers': {...},
#  'href': 'http://export.arxiv.org/rss/cs.CV',
#  'namespaces': {...},
#  'status': 301,
#  'updated': 'Fri, 26 Jul 2019 00:30:00 GMT',
#  'updated_parsed': time.struct_time(tm_year=2019, tm_mon=7, tm_mday=26, tm_hour=0, tm_min=30, tm_sec=0, tm_wday=4, tm_yday=207, tm_isdst=0),
#  'version': 'rss10'}

source: arxiv_rss.py

entriesに各論文の情報が格納されている。

print(type(d['entries']))
# <class 'list'>

print(len(d['entries']))
# 67

print(type(d['entries'][0]))
# <class 'feedparser.FeedParserDict'>

pprint.pprint(d['entries'][0], width=100)
# {'author': '<a href="http://arxiv.org/find/cs/1/au:+Kurmi_V/0/1/0/all/0/1">Vinod Kumar Kurmi</a>, '
#            '<a href="http://arxiv.org/find/cs/1/au:+Bajaj_V/0/1/0/all/0/1">Vipul Bajaj</a>, <a '
#            'href="http://arxiv.org/find/cs/1/au:+Subramanian_V/0/1/0/all/0/1">Venkatesh K '
#            'Subramanian</a>, <a '
#            'href="http://arxiv.org/find/cs/1/au:+Namboodiri_V/0/1/0/all/0/1">Vinay P '
#            'Namboodiri</a>',
#  'author_detail': {'name': '<a href="http://arxiv.org/find/cs/1/au:+Kurmi_V/0/1/0/all/0/1">Vinod '
#                            'Kumar Kurmi</a>, <a '
#                            'href="http://arxiv.org/find/cs/1/au:+Bajaj_V/0/1/0/all/0/1">Vipul '
#                            'Bajaj</a>, <a '
#                            'href="http://arxiv.org/find/cs/1/au:+Subramanian_V/0/1/0/all/0/1">Venkatesh '
#                            'K Subramanian</a>, <a '
#                            'href="http://arxiv.org/find/cs/1/au:+Namboodiri_V/0/1/0/all/0/1">Vinay '
#                            'P Namboodiri</a>'},
#  'authors': [{'name': '<a href="http://arxiv.org/find/cs/1/au:+Kurmi_V/0/1/0/all/0/1">Vinod Kumar '
#                       'Kurmi</a>, <a '
#                       'href="http://arxiv.org/find/cs/1/au:+Bajaj_V/0/1/0/all/0/1">Vipul '
#                       'Bajaj</a>, <a '
#                       'href="http://arxiv.org/find/cs/1/au:+Subramanian_V/0/1/0/all/0/1">Venkatesh '
#                       'K Subramanian</a>, <a '
#                       'href="http://arxiv.org/find/cs/1/au:+Namboodiri_V/0/1/0/all/0/1">Vinay P '
#                       'Namboodiri</a>'}],
#  'id': 'http://arxiv.org/abs/1907.10628',
#  'link': 'http://arxiv.org/abs/1907.10628',
#  'links': [{'href': 'http://arxiv.org/abs/1907.10628', 'rel': 'alternate', 'type': 'text/html'}],
#  'summary': '<p>Domain adaptation is essential to enable wide usage of deep learning based\n'
#             'networks trained using large labeled datasets. Adversarial learning based\n'
#             'techniques have shown their utility towards solving this problem using a\n'
#             'discriminator that ensures source and target distributions are close. However,\n'
#             'here we suggest that rather than using a point estimate, it would be useful if\n'
#             'a distribution based discriminator could be used to bridge this gap. This could\n'
#             'be achieved using multiple classifiers or using traditional ensemble methods.\n'
#             'In contrast, we suggest that a Monte Carlo dropout based ensemble discriminator\n'
#             'could suffice to obtain the distribution based discriminator. Specifically, we\n'
#             'propose a curriculum based dropout discriminator that gradually increases the\n'
#             'variance of the sample based distribution and the corresponding reverse\n'
#             'gradients are used to align the source and target feature representations. The\n'
#             'detailed results and thorough ablation analysis show that our model outperforms\n'
#             'state-of-the-art results.\n'
#             '</p>',
#  'summary_detail': {'base': 'http://export.arxiv.org/rss/cs.CV',
#                     'language': None,
#                     'type': 'text/html',
#                     'value': '<p>Domain adaptation is essential to enable wide usage of deep '
#                              'learning based\n'
#                              'networks trained using large labeled datasets. Adversarial learning '
#                              'based\n'
#                              'techniques have shown their utility towards solving this problem '
#                              'using a\n'
#                              'discriminator that ensures source and target distributions are '
#                              'close. However,\n'
#                              'here we suggest that rather than using a point estimate, it would be '
#                              'useful if\n'
#                              'a distribution based discriminator could be used to bridge this gap. '
#                              'This could\n'
#                              'be achieved using multiple classifiers or using traditional ensemble '
#                              'methods.\n'
#                              'In contrast, we suggest that a Monte Carlo dropout based ensemble '
#                              'discriminator\n'
#                              'could suffice to obtain the distribution based discriminator. '
#                              'Specifically, we\n'
#                              'propose a curriculum based dropout discriminator that gradually '
#                              'increases the\n'
#                              'variance of the sample based distribution and the corresponding '
#                              'reverse\n'
#                              'gradients are used to align the source and target feature '
#                              'representations. The\n'
#                              'detailed results and thorough ablation analysis show that our model '
#                              'outperforms\n'
#                              'state-of-the-art results.\n'
#                              '</p>'},
#  'title': 'Curriculum based Dropout Discriminator for Domain Adaptation. (arXiv:1907.10628v1 '
#           '[cs.LG])',
#  'title_detail': {'base': 'http://export.arxiv.org/rss/cs.CV',
#                   'language': None,
#                   'type': 'text/plain',
#                   'value': 'Curriculum based Dropout Discriminator for Domain Adaptation. '
#                            '(arXiv:1907.10628v1 [cs.LG])'}}

print(d['entries'][0]['link'])
# http://arxiv.org/abs/1907.10628

print(d['entries'][0]['title'])
# Curriculum based Dropout Discriminator for Domain Adaptation. (arXiv:1907.10628v1 [cs.LG])

source: arxiv_rss.py

feedparserの詳細は以下の記事を参照。

関連記事: Python, feedparserでRSS, Atomフィードを解析

上の結果から分かるように、RSSに含まれている情報はAPIで取得できる情報よりも項目が少なく、htmlタグが含まれているという違いもある。

論文のリンク（URL）やタイトルだけが必要であればRSSで十分だが、最新論文の投稿日時なども含む情報を取得したい場合はAPIを使う。上のarxiv.query()の例で紹介したように、max_results=10, sort_by='submittedDate'などとすればよい。以下に再掲する。

l = arxiv.query(query='cat:cs.AI', max_results=10, sort_by='submittedDate')

pprint.pprint([[a['id'], a['published']] for a in l])
# [['http://arxiv.org/abs/1907.11184v1', '2019-07-25T16:45:06Z'],
#  ['http://arxiv.org/abs/1907.11112v1', '2019-07-25T14:45:04Z'],
#  ['http://arxiv.org/abs/1907.11021v1', '2019-07-25T13:15:12Z'],
#  ['http://arxiv.org/abs/1907.11007v1', '2019-07-25T12:30:08Z'],
#  ['http://arxiv.org/abs/1907.10953v1', '2019-07-25T10:36:01Z'],
#  ['http://arxiv.org/abs/1907.10952v1', '2019-07-25T10:31:34Z'],
#  ['http://arxiv.org/abs/1907.10925v1', '2019-07-25T09:34:13Z'],
#  ['http://arxiv.org/abs/1907.10914v1', '2019-07-25T09:19:30Z'],
#  ['http://arxiv.org/abs/1907.10772v1', '2019-07-24T23:28:37Z'],
#  ['http://arxiv.org/abs/1907.10761v1', '2019-07-24T22:30:04Z']]

source: arxiv_api.py

PythonでarXiv APIを使って論文情報取得、PDFダウンロード

arXiv APIのPythonラッパー: arxiv

検索条件を指定して論文情報（メタデータ）を取得

arxiv.query()の基本的な使い方

arxiv.query()の引数

検索クエリを指定: query

取得件数を指定: max_results, max_chunk_results

ソート順を指定: sort_by, sort_order

IDのリストを指定: id_list

具体例

論文のPDFをダウンロード（個別・一括）

arXivのRSSを取得

関連カテゴリー

関連記事