paperfetcher package

Submodules

paperfetcher.apiclients module

Client implementations to communicate with various APIs.

class paperfetcher.apiclients.COCIQuery(components={}, query_params={})

Bases: paperfetcher.apiclients.Query

Class for structuring and executing COCI REST API queries.

Parameters
  • components (collections.OrderedDict) – Components to append to the base URL.

  • query_params (collections.OrderedDict) – Ordered dictionary of query parameters.

components

Components to append to the base URL.

Type

collections.OrderedDict

query_base

Base URL for query (https://opencitations.net/index/coci/api/v{}/…).

Type

str

query_params

Dictionary of query parameters.

Type

collections.OrderedDict

headers

Dictionary of HTTP headers.

Type

dict

response

Response recieved on executing GET query to the COCI API.

Type

requests.Response

Examples

Querying the references of a paper with a known DOI:

>>> query = COCIQuery(components=OrderedDict([("references", "10.1021/acs.jpcb.1c02191")]))
>>> query()
>>> query.response
<Response [200]>
>>> query.response.json()
[{'oci': '020010002013610122837192512113701120002010901-0200100000236191212370201090809', 'creation': '2021-05-12', 'timespan': 'P9Y5M19D',
...}]

Querying the citations of a paper with a known DOI:

>>> query = COCIQuery(components=OrderedDict([("citations", "10.1021/acs.jpcb.1c02191")]))
>>> query()
>>> query.response
<Response [200]>
>>> query.response.json()
[{'oci': '0200100000236252421370200020100050206-020010002013610122837192512113701120002010901', 'creation': '2021-10-02', 'timespan': 'P4M20D',
 'journal_sc': 'no', 'author_sc': 'no', 'citing': '10.1002/pol.20210526', 'cited': '10.1021/acs.jpcb.1c02191'}]
class paperfetcher.apiclients.CrossrefQuery(components={}, query_params={})

Bases: paperfetcher.apiclients.Query

Class for structuring and executing Crossref REST API queries.

Query components can be added to the base URL by passing an ordered dictionary to the components argument. For example, the components dictionary {“comp1-key”: “comp1-value”, “comp2”: None} changes the query URL to https://api.crossref.org/comp1-key/comp1-value/comp2/.

Parameters
  • components (collections.OrderedDict) – Components to append to the base URL.

  • query_params (collections.OrderedDict) – Ordered dictionary of query parameters.

components

Components to append to the base URL.

Type

collections.OrderedDict

query_base

Base URL for query (https://api.crossref.org/…).

Type

str

query_params

Dictionary of query parameters.

Type

collections.OrderedDict

headers

Dictionary of HTTP headers.

Type

dict

response

Response recieved on executing GET query to the Crossref API.

Type

requests.Response

Examples

Querying the metadata of a paper with a known DOI:

>>> query = CrossrefQuery(components={"works": "10.1021/acs.jpcb.1c02191"})
>>> query()
>>> query.response
<Response [200]>
>>> query.response.json()
{'status': 'ok', 'message-type': 'work', 'message-version': '1.0.0',
'message': {...}}

Query to fetch all articles from a journal with a known ISSN:

>>> components = OrderedDict([("journals", "1520-5126"),
                          ("works", None)])
>>> query = CrossrefQuery(components)
>>> query()
>>> query.response
<Response [200]>
>>> query.response.json()
{'status': 'ok', 'message-type': 'work', 'message-version': '1.0.0',
'message': {...}}
class paperfetcher.apiclients.Query(base_url=None, query_params: dict = {}, headers: str = {})

Bases: object

Base class for structuring and executing HTTP GET queries.

Parameters
  • base_url (str) – Base URL for query (such as api.xyz.com/get).

  • query_params (dict) – Dictionary of query parameters.

  • headers (dict) – Dictionary of HTTP headers to pass along with the query.

query_base

Base URL for query (such as api.xyz.com/get).

Type

str

query_params

Dictionary of query parameters.

Type

dict

headers

Dictionary of HTTP headers.

Type

dict

response

Response recieved on executing GET query.

Type

requests.Response

Examples

A simple Query to the Github REST API:

>>> query = Query("https://api.github.com")
>>> query()
>>> query.response
<Response [200]>

A Query to the Github REST API to fetch a list of all public repositories in the paperfetcher organization:

>>> query = Query("https://api.github.com/orgs/paperfetcher/repos",
...               query_params={"type": "public"},
...               headers={"Accept": "application/vnd.github.v3+json"})
>>> query()
>>> query.response
<Response [200]>
property response

paperfetcher.datastructures module

Custom data structures for paperfetcher.

class paperfetcher.datastructures.CitationsDataset(field_names: tuple, items: list = [])

Bases: paperfetcher.datastructures.Dataset

Stores a tabular dataset of citations, with multiple custom fields.

CitationsDatasets can be exported to pandas DataFrames, and loaded from or saved to disk in text, CSV, or Excel file formats.

Parameters
  • field_names (tuple) – Names (str) of fields.

  • items (list) – List of citations to store (default=[]). Each citation should be an iterable of length len(field_names).

Examples

To create a CitationsDataset object to store the DOI, URL, article title, authors, and date of issue for each citation:

>>> field_names = ["DOI", "URL", "title", "author", "issued"]
>>> data = [["10.xxyy/0.0.0.000001", "https://dx.doi.org/10.xxyy/0.0.0.000001", "A study of A", "P, Q and R", "2020-02-20"],
...         ["10.xxyy/0.0.0.000002", "https://dx.doi.org/10.xxyy/0.0.0.000002", "An investigation into B", "Q, R and P", "2020-03-20"],
...         ["10.xxyy/0.0.0.000003", "https://dx.doi.org/10.xxyy/0.0.0.000003", "The causes of C in D", "R, P and Q", "2020-04-20"],
...         ["10.xxyy/0.0.0.000004", "https://dx.doi.org/10.xxyy/0.0.0.000004", "Characterizing the role of E on F", "P and Q", "2020-05-20"]]
>>> ds = CitationsDataset(field_names, data)

To add a citation to the CitationsDataset:

>>> ds.append(["10.xxyy/0.0.0.000005", "https://dx.doi.org/10.xxyy/0.0.0.000005", "The G effect", "Q and P", "2020-06-20"])

To export the DOIDataset object to a pandas DataFrame:

>>> df = ds.to_df()
>>> df
                    DOI                                      URL                              title      author      issued
0  10.xxyy/0.0.0.000001  https://dx.doi.org/10.xxyy/0.0.0.000001                       A study of A  P, Q and R  2020-02-20
1  10.xxyy/0.0.0.000002  https://dx.doi.org/10.xxyy/0.0.0.000002            An investigation into B  Q, R and P  2020-03-20
2  10.xxyy/0.0.0.000003  https://dx.doi.org/10.xxyy/0.0.0.000003               The causes of C in D  R, P and Q  2020-04-20
3  10.xxyy/0.0.0.000004  https://dx.doi.org/10.xxyy/0.0.0.000004  Characterizing the role of E on F     P and Q  2020-05-20
4  10.xxyy/0.0.0.000005  https://dx.doi.org/10.xxyy/0.0.0.000005                       The G effect     Q and P  2020-06-20

To save data to disk:

>>> ds.save_txt("cits.txt")
>>> ds.save_csv("cits.csv")
>>> ds.save_excel("cits.xlsx")
append(item)

Adds a citation to the dataset.

extend(items)

Adds each citation from a list of citations (i.e. eacher inner list of nested list) to the dataset.

save_csv(file)

Saves dataset to .csv file. The first row of the CSV file contains field names.

save_excel(file)

Saves dataset to Excel file (uses Pandas). The first row of the Excel file contains field names.

save_txt(file)

Saves dataset to .txt file.

to_df()

Converts dataset to DataFrame.

class paperfetcher.datastructures.DOIDataset(items: list = [])

Bases: paperfetcher.datastructures.Dataset

Stores a dataset of DOIs.

DOIDatasets can be exported to pandas DataFrames, and loaded from or saved to disk in text, CSV, or Excel file formats.

Parameters

items (list) – List of DOIs (str) to store (default=[]).

Examples

To create a DOIDataset object from a list of DOIs:

>>> ds = DOIDataset(["x1.y1.z1/123123", "x2.y2.z2/456456"])

To add a DOI to the DOIDataset object:

>>> ds.append("x3.y3.z3/789789")

To export the DOIDataset object to a pandas DataFrame:

>>> df = ds.to_df()
>>> df
               DOI
0  x1.y1.z1/123123
1  x2.y2.z2/456456
3  x3.y3.z3/789789

To save data to disk:

>>> ds.save_txt("dois.txt")
>>> ds.save_csv("dois.csv")
>>> ds.save_excel("dois.xlsx")
extend_dataset(ds: paperfetcher.datastructures.DOIDataset)

Appends all items from DOIDataset ds to the end of the current dataset.

save_csv(file)

Saves dataset to .csv file.

save_excel(file)

Saves dataset to Excel file.

save_txt(file)

Saves dataset to .txt file.

to_df()

Converts dataset to DataFrame.

to_txt_string()

Returns a string which can be written to .txt file

class paperfetcher.datastructures.Dataset(items: list = [])

Bases: object

Abstract interface that defines functions for child Dataset classes to implement.

Datasets are designed to store [usually tabular] data (as input to or output from paperfetcher searches), export data to pandas DataFrames, and load/save data to disk using common data formats (txt, csv, xlsx).

Parameters

items (iterable) – Items to store in dataset (default=[]).

append(item)

Adds an item to the dataset.

extend(items: list)

Adds each item from a list of items to the dataset.

classmethod from_csv(file)

Loads dataset from .csv file.

classmethod from_excel(file)

Loads dataset from Excel file file.

classmethod from_txt(file)

Loads dataset from .txt file.

save_csv(file)

Saves dataset to .csv file.

save_excel(file)

Saves dataset to Excel file.

save_txt(file)

Saves dataset to .txt file.

to_df()

Converts dataset to DataFrame.

class paperfetcher.datastructures.HeadlessRISWriter(*, mapping: Optional[Dict] = None, list_tags: Optional[List[str]] = None, ignore: Optional[List[str]] = None, skip_unknown_tags: bool = False, enforce_list_tags: bool = True)

Bases: rispy.writer.BaseWriter

DEFAULT_LIST_TAGS: List[str] = ['A1', 'A2', 'A3', 'A4', 'AU', 'KW', 'N1']
DEFAULT_MAPPING: Dict = {'A1': 'first_authors', 'A2': 'secondary_authors', 'A3': 'tertiary_authors', 'A4': 'subsidiary_authors', 'AB': 'abstract', 'AD': 'author_address', 'AN': 'accession_number', 'AU': 'authors', 'C1': 'custom1', 'C2': 'custom2', 'C3': 'custom3', 'C4': 'custom4', 'C5': 'custom5', 'C6': 'custom6', 'C7': 'custom7', 'C8': 'custom8', 'CA': 'caption', 'CN': 'call_number', 'CY': 'place_published', 'DA': 'date', 'DB': 'name_of_database', 'DO': 'doi', 'DP': 'database_provider', 'EP': 'end_page', 'ER': 'end_of_reference', 'ET': 'edition', 'ID': 'id', 'IS': 'number', 'J2': 'alternate_title1', 'JA': 'alternate_title2', 'JF': 'alternate_title3', 'JO': 'journal_name', 'KW': 'keywords', 'L1': 'file_attachments1', 'L2': 'file_attachments2', 'L4': 'figure', 'LA': 'language', 'LB': 'label', 'M1': 'note', 'M3': 'type_of_work', 'N1': 'notes', 'N2': 'notes_abstract', 'NV': 'number_of_volumes', 'OP': 'original_publication', 'PB': 'publisher', 'PY': 'year', 'RI': 'reviewed_item', 'RN': 'research_notes', 'RP': 'reprint_edition', 'SE': 'section', 'SN': 'issn', 'SP': 'start_page', 'ST': 'short_title', 'T1': 'primary_title', 'T2': 'secondary_title', 'T3': 'tertiary_title', 'TA': 'translated_author', 'TI': 'title', 'TT': 'translated_title', 'TY': 'type_of_reference', 'UK': 'unknown_tag', 'UR': 'url', 'VL': 'volume', 'Y1': 'publication_year', 'Y2': 'access_date'}
PATTERN: str = '{tag}  - {value}'
START_TAG: str = 'TY'
class paperfetcher.datastructures.RISDataset(items: list = [])

Bases: paperfetcher.datastructures.Dataset

Stores a dataset of RIS items. RIS items are rispy-readable dictionaries.

An RISDataset can be created from an RIS-formatted string, or from an RIS file. An RISDataset can be written to an RIS-formatted string, or to an RIS file. Individual item dictionaries in an RISDataset can be modified to add new tags or change the values of existing tags.

Parameters

items (list) – List of citations to store (default=[]). Each citation should be a rispy-readable dictionary.

Examples

To create an RISDataset from a list of rispy-readable dictionaries (see rispy doc on GitHub for details):

>>> dict_list = [{'journal_name': ...}, {'journal_name: ...'}, ...]
>>> ds = RISDataset(dict_list)

To load an RISDataset from an RIS-formatted string: >>> ds = RISDataset.from_ris_string(ris_string)

To load an RISDataset from an RIS file: >>> ds = RISDataset.from_ris(ris_file)

extend_dataset(ds: paperfetcher.datastructures.RISDataset)

Appends all items from RISDataset ds to the end of the current dataset.

classmethod from_ris(file)

Loads dataset from RIS file.

classmethod from_ris_string(ris_string)

Loads dataset from RIS-formatted string.

save_ris(filename, headers=False)

Saves dataset to .ris file.

Parameters
  • filename – Path to file to write RIS data to.

  • headers (bool, default=False) – If set to true, writes reference number before each RIS entry.

to_ris_string(headers=False)

Returns a string which can be written to .ris file.

Parameters

headers (bool, default=False) – If set to true, writes reference number before each RIS entry.

paperfetcher.exceptions module

Definitions of all Exceptions raised by paperfetcher.

exception paperfetcher.exceptions.ContentNegotiationError

Bases: Exception

Exception raised when content negotiation fails.

exception paperfetcher.exceptions.DatasetError

Bases: Exception

Exception raised when an incorrect operation is performed on a dataset.

exception paperfetcher.exceptions.QueryError

Bases: Exception

Exception raised when query fails.

exception paperfetcher.exceptions.RISParsingError

Bases: Exception

Exception raised when RIS parsing fails.

exception paperfetcher.exceptions.SearchError

Bases: Exception

Exception raised when search fails.

paperfetcher.handsearch module

Classes to fetch all journal works (articles) matching a set of keywords and within a given date range by querying various APIs.

class paperfetcher.handsearch.CrossrefSearch(ISSN='', type='journal-article', keyword_list=None, from_date=None, until_date=None, batch_size=20, sort_order='desc')

Bases: object

Retrieves all works from a journal, given its ISSN, which match a set of keywords and are within a date range.

Calling a CrossrefSearch object performs the search. A search object can be called with the arguments display_progress_bar (True/False; default=True) to toggle the display of a search progress bar, select (True/False; default=False), and select_fields (list) to query only a subset of metadata for each journal article.

If select is False, a full (memory and time intensive) search is performed, fetching all metadata associated with each journal work.

If select is True, a subset of fields to fetch can be specified using the select_fields parameter. Check the Crossref REST API doc for details on which field names are permissible.

Note

Performing a search with no keywords and select=False can be very time- and memory- intensive. The search object will complain when such a search is performed.

Parameters
  • ISSN (str) – Journal (web) ISSN.

  • type (str) – Type of works to fetch (default=”journal-article”).

  • keyword_list (list) – List of keywords (str) to query with (default=None).

  • from_date (str) – Fetch articles published from (and after) this date (format=”YYYY-MM-DD”, default=None).

  • until_date (str) – Fetch articles published until this date (format=”YYYY-MM-DD”, default=None).

  • batch_size (int) – Number of works to fetch in each batch (default=20).

  • sort_order (str) – Order in which to sort works by date (“asc” or “desc”, default=”desc”).

ISSN

Journal (web) ISSN.

Type

str

type

Type of works to fetch (default=”journal-article”).

Type

str

keyword_list

List of keywords (str) to query with.

Type

list

from_date

Fetch articles published from (and after) this date (format=”YYYY-MM-DD”).

Type

str

until_date

Fetch articles published until this date (format=”YYYY-MM-DD”).

Type

str

batch_size

Number of works to fetch in each batch (default=20).

Type

int

sort_order

Order in which to sort works by date (“asc” or “desc”, default=”desc”).

Type

str

results

List of dictionaries, each dictionary corresponds to a work.

Type

list

Examples

>>> search = CrossrefSearch(ISSN="1520-5126", keyword_list=["hydration"], from_date="2018-01-01", until_date="2020-01-01")
>>> search()
>>> len(search)
13
>>> ds = search.get_DOIDataset()
>>> ds.to_df()
                     DOI
0   10.1021/jacs.9b09103
1   10.1021/jacs.9b06862
2   10.1021/jacs.9b09111
3   10.1021/jacs.9b05874
4   10.1021/jacs.9b02820
5   10.1021/jacs.9b05136
6   10.1021/jacs.9b02742
7   10.1021/jacs.9b00577
8   10.1021/jacs.8b11448
9   10.1021/jacs.8b12877
10  10.1021/jacs.8b11667
11  10.1021/jacs.8b08298
12  10.1021/jacs.7b11537
dry_run(select=False, select_fields=[])

How many works will this search fetch?

get_CitationsDataset(field_list=[], field_parsers_list=[])

Parses a selection of fields from search results and returns them as a CitationsDataset object.

Parameters
  • field_list (list) – Names of fields to parse (see Crossref REST API doc for permissible field name values).

  • field_parsers_list (list) – List of field parser functions corresponding to each field name. A None value means that no parser is needed for that field.

Returns

CitationsDataset

Example

>>> search = handsearch.CrossrefSearch(ISSN="1520-5126", keyword_list=["hydration"], from_date="2018-01-01",
...                                    until_date="2020-01-01")
>>> search(select=True, select_fields=['DOI', 'URL', 'title', 'author', 'issued'])
>>> ds = search.get_CitationsDataset(field_list=['DOI', 'URL', 'title', 'author', 'issued'],
...                                  field_parsers_list=[None, None, parsers.crossref_title_parser,
...                                                      parsers.crossref_authors_parser, parsers.crossref_date_parser])
get_DOIDataset()

Extracts DOIs from search results and returns them as a DOIDataset object.

Returns

DOIDataset

get_RISDataset(extra_field_list=[], extra_field_parser_list=[], extra_field_rispy_tags=[])

Extracts DOIs from search results and fetches RIS data for each DOI using Crossref’s content negotiation service.

Extra fields in the search results that are not automatically populated by Crossref’s content negotation service can be mapped to the RIS format (through rispy’s mapping) using the extra_fields, extra_field_parser_list, and extra_field_rispy_tags arguments.

Parameters
  • extra_field_list (list) – List of extra fields to parse and include in RIS file (see Crossref REST API doc for permissible field name values).

  • extra_field_parser_list (list) – List of field parser functions corresponding to each extra field name. A None value means that no parser is needed for that field.

  • extra_field_rispy_tags (list) – List of rispy tags for each extra field.

paperfetcher.parsers module

Functions to parse data returned from queries.

New parsers can be added here.

paperfetcher.parsers.crossref_authors_parser(author_array)

Function to parse authors.

Returns

str

paperfetcher.parsers.crossref_date_parser(date)

Function to parse date.

Returns

str

paperfetcher.parsers.crossref_title_parser(title)

Function to parse title.

Returns

str

paperfetcher.snowballsearch module

Classes to fetch all journal articles in the references of (i.e. backward search) or citing (i.e. forward search) a set of journal articles.

For backward search, you can use either Crossref or COCI (should be equivalent). For forward search, you can only use COCI at the moment.

class paperfetcher.snowballsearch.COCIBackwardReferenceSearch(search_dois: list)

Bases: object

Retrieves the (DOIs of) all articles in the references of a list of (DOIs of) articles by using the COCI REST API.

Parameters

search_dois (list) – List of DOIs (str) to fetch references of.

search_dois

List of DOIs (str) to fetch references of.

Type

list

result_dois

Set of DOIs which are referenced by the DOIs in search_dois.

Type

set

Example

>>> search = snowballsearch.COCIBackwardReferenceSearch(["10.1021/acs.jpcb.1c02191", "10.1073/pnas.2018234118"])
>>> search()
>>> len(search)
140
>>> search.result_dois
{'10.1021/jp972543+', '10.1073/pnas.0708088105',  ... ,  '10.1073/pnas.0705830104'}
classmethod from_DOIDataset(search_dataset: paperfetcher.datastructures.DOIDataset)

Constructs a search object from a DOIDataset.

Parameters

search_dataset (DOIDataset) – Dataset of DOIs to fetch references of.

Returns

COCIBackwardReferenceSearch

get_DOIDataset()

Returns search results as a DOIDataset object.

Returns

DOIDataset

get_RISDataset()

Returns search results as an RISDataset object. Uses the Crossref REST API for content negotation.

Returns

RISDataset

class paperfetcher.snowballsearch.COCIForwardCitationSearch(search_dois: list)

Bases: object

Retrieves the (DOIs of) all articles citing a list of (DOIs of) articles by using the COCI REST API.

Parameters

search_dois (list) – List of DOIs (str) to fetch citations of.

search_dois

List of DOIs (str) to fetch citations of.

Type

list

result_dois

Set of DOIs which cite the DOIs in search_dois.

Type

set

Example

>>> search = snowballsearch.COCIForwardCitationSearch(["10.1021/acs.jpcb.8b11423", "10.1073/pnas.2018234118"])
>>> search()
>>> len(search)
11
>>> search.result_dois
{'10.1039/c9sc02097g', '10.1021/acs.jpcb.1c05748', ... , '10.1021/acs.jpclett.9b02052'}
classmethod from_DOIDataset(search_dataset: paperfetcher.datastructures.DOIDataset)

Constructs a search object from a DOIDataset.

Parameters

search_dataset (DOIDataset) – Dataset of DOIs to fetch references of.

Returns

COCIForwardReferenceSearch

get_DOIDataset()

Returns search results as a DOIDataset object.

Returns

DOIDataset

get_RISDataset()

Returns search results as an RISDataset object. Uses the Crossref REST API for content negotation.

Returns

RISDataset

class paperfetcher.snowballsearch.CrossrefBackwardReferenceSearch(search_dois: list)

Bases: object

Retrieves (the DOIs of) all articles in the references of a list of (DOIs of) articles by using the Crossref REST API.

Parameters

search_dois (list) – List of DOIs (str) to fetch references of.

search_dois

List of DOIs (str) to fetch references of.

Type

list

result_dois

Set of DOIs which are referenced by the DOIs in search_dois.

Type

set

Example

>>> search = snowballsearch.CrossrefBackwardReferenceSearch(["10.1021/acs.jpcb.1c02191", "10.1073/pnas.2018234118"])
>>> search()
>>> len(search)
140
>>> search.result_dois
{'10.1021/jp972543+', '10.1073/pnas.0708088105',  ... ,  '10.1073/pnas.0705830104'}
classmethod from_DOIDataset(search_dataset: paperfetcher.datastructures.DOIDataset)

Constructs a search object from a DOIDataset.

Parameters

search_dataset (DOIDataset) – Dataset of DOIs to fetch references of.

Returns

CrossrefBackwardReferenceSearch

get_DOIDataset()

Returns search results as a DOIDataset object.

Returns

DOIDataset

get_RISDataset()

Returns search results as an RISDataset object. Uses the Crossref REST API for content negotation.

Returns

RISDataset

Module contents

class paperfetcher.GlobalConfig

Bases: object

crossref_plus = False
crossref_plus_auth_token = ''
crossref_useragent = 'paperfetcher/0.0.1 (https://github.com/paperfetcher/paperfetcher; mailto:pallathakash@gmail.com)'
loglevel = 20
streamlit = False