paperfetcher package

Submodules

paperfetcher.apiclients module

Client implementations to communicate with various APIs.

class paperfetcher.apiclients.COCIQuery(components={}, query_params={})

Bases: paperfetcher.apiclients.Query

Class for structuring and executing COCI REST API queries.

Parameters

components (collections.OrderedDict) – Components to append to the base URL.
query_params (collections.OrderedDict) – Ordered dictionary of query parameters.

components

Components to append to the base URL.

Type: collections.OrderedDict

query_base

Base URL for query (https://opencitations.net/index/coci/api/v{}/…).

Type: str

query_params

Dictionary of query parameters.

Type: collections.OrderedDict

headers

Dictionary of HTTP headers.

Type: dict

response

Response recieved on executing GET query to the COCI API.

Type: requests.Response

Examples

Querying the references of a paper with a known DOI:

>>> query = COCIQuery(components=OrderedDict([("references", "10.1021/acs.jpcb.1c02191")]))
>>> query()
>>> query.response
<Response [200]>
>>> query.response.json()
[{'oci': '020010002013610122837192512113701120002010901-0200100000236191212370201090809', 'creation': '2021-05-12', 'timespan': 'P9Y5M19D',
...}]

Querying the citations of a paper with a known DOI:

>>> query = COCIQuery(components=OrderedDict([("citations", "10.1021/acs.jpcb.1c02191")]))
>>> query()
>>> query.response
<Response [200]>
>>> query.response.json()
[{'oci': '0200100000236252421370200020100050206-020010002013610122837192512113701120002010901', 'creation': '2021-10-02', 'timespan': 'P4M20D',
 'journal_sc': 'no', 'author_sc': 'no', 'citing': '10.1002/pol.20210526', 'cited': '10.1021/acs.jpcb.1c02191'}]

class paperfetcher.apiclients.CrossrefQuery(components={}, query_params={})

Bases: paperfetcher.apiclients.Query

Class for structuring and executing Crossref REST API queries.

Query components can be added to the base URL by passing an ordered dictionary to the components argument. For example, the components dictionary {“comp1-key”: “comp1-value”, “comp2”: None} changes the query URL to https://api.crossref.org/comp1-key/comp1-value/comp2/.

Parameters

components (collections.OrderedDict) – Components to append to the base URL.
query_params (collections.OrderedDict) – Ordered dictionary of query parameters.

components

Components to append to the base URL.

Type: collections.OrderedDict

query_base

Base URL for query (https://api.crossref.org/…).

Type: str

query_params

Dictionary of query parameters.

Type: collections.OrderedDict

headers

Dictionary of HTTP headers.

Type: dict

response

Response recieved on executing GET query to the Crossref API.

Type: requests.Response

Examples

Querying the metadata of a paper with a known DOI:

>>> query = CrossrefQuery(components={"works": "10.1021/acs.jpcb.1c02191"})
>>> query()
>>> query.response
<Response [200]>
>>> query.response.json()
{'status': 'ok', 'message-type': 'work', 'message-version': '1.0.0',
'message': {...}}

Query to fetch all articles from a journal with a known ISSN:

>>> components = OrderedDict([("journals", "1520-5126"),
                          ("works", None)])
>>> query = CrossrefQuery(components)
>>> query()
>>> query.response
<Response [200]>
>>> query.response.json()
{'status': 'ok', 'message-type': 'work', 'message-version': '1.0.0',
'message': {...}}

class paperfetcher.apiclients.Query(base_url=None, query_params: dict = {}, headers: str = {})

Bases: object

Base class for structuring and executing HTTP GET queries.

Parameters

base_url (str) – Base URL for query (such as api.xyz.com/get).
query_params (dict) – Dictionary of query parameters.
headers (dict) – Dictionary of HTTP headers to pass along with the query.

query_base

Base URL for query (such as api.xyz.com/get).

Type: str

query_params

Dictionary of query parameters.

Type: dict

headers

Dictionary of HTTP headers.

Type: dict

response

Response recieved on executing GET query.

Type: requests.Response

Examples

A simple Query to the Github REST API:

>>> query = Query("https://api.github.com")
>>> query()
>>> query.response
<Response [200]>

A Query to the Github REST API to fetch a list of all public repositories in the paperfetcher organization:

>>> query = Query("https://api.github.com/orgs/paperfetcher/repos",
...               query_params={"type": "public"},
...               headers={"Accept": "application/vnd.github.v3+json"})
>>> query()
>>> query.response
<Response [200]>

property response

paperfetcher.datastructures module

Custom data structures for paperfetcher.

class paperfetcher.datastructures.CitationsDataset(field_names: tuple, items: list = [])

Bases: paperfetcher.datastructures.Dataset

Stores a tabular dataset of citations, with multiple custom fields.

CitationsDatasets can be exported to pandas DataFrames, and loaded from or saved to disk in text, CSV, or Excel file formats.

Parameters

field_names (tuple) – Names (str) of fields.
items (list) – List of citations to store (default=[]). Each citation should be an iterable of length len(field_names).

Examples

To create a CitationsDataset object to store the DOI, URL, article title, authors, and date of issue for each citation:

>>> field_names = ["DOI", "URL", "title", "author", "issued"]
>>> data = [["10.xxyy/0.0.0.000001", "https://dx.doi.org/10.xxyy/0.0.0.000001", "A study of A", "P, Q and R", "2020-02-20"],
...         ["10.xxyy/0.0.0.000002", "https://dx.doi.org/10.xxyy/0.0.0.000002", "An investigation into B", "Q, R and P", "2020-03-20"],
...         ["10.xxyy/0.0.0.000003", "https://dx.doi.org/10.xxyy/0.0.0.000003", "The causes of C in D", "R, P and Q", "2020-04-20"],
...         ["10.xxyy/0.0.0.000004", "https://dx.doi.org/10.xxyy/0.0.0.000004", "Characterizing the role of E on F", "P and Q", "2020-05-20"]]
>>> ds = CitationsDataset(field_names, data)

To add a citation to the CitationsDataset:

>>> ds.append(["10.xxyy/0.0.0.000005", "https://dx.doi.org/10.xxyy/0.0.0.000005", "The G effect", "Q and P", "2020-06-20"])

To export the DOIDataset object to a pandas DataFrame:

>>> df = ds.to_df()
>>> df
                    DOI                                      URL                              title      author      issued
0  10.xxyy/0.0.0.000001  https://dx.doi.org/10.xxyy/0.0.0.000001                       A study of A  P, Q and R  2020-02-20
1  10.xxyy/0.0.0.000002  https://dx.doi.org/10.xxyy/0.0.0.000002            An investigation into B  Q, R and P  2020-03-20
2  10.xxyy/0.0.0.000003  https://dx.doi.org/10.xxyy/0.0.0.000003               The causes of C in D  R, P and Q  2020-04-20
3  10.xxyy/0.0.0.000004  https://dx.doi.org/10.xxyy/0.0.0.000004  Characterizing the role of E on F     P and Q  2020-05-20
4  10.xxyy/0.0.0.000005  https://dx.doi.org/10.xxyy/0.0.0.000005                       The G effect     Q and P  2020-06-20

To save data to disk:

>>> ds.save_txt("cits.txt")
>>> ds.save_csv("cits.csv")
>>> ds.save_excel("cits.xlsx")

append(item): Adds a citation to the dataset.

extend(items): Adds each citation from a list of citations (i.e. eacher inner list of nested list) to the dataset.

save_csv(file): Saves dataset to .csv file. The first row of the CSV file contains field names.

save_excel(file): Saves dataset to Excel file (uses Pandas). The first row of the Excel file contains field names.

save_txt(file): Saves dataset to .txt file.

to_df(): Converts dataset to DataFrame.

class paperfetcher.datastructures.DOIDataset(items: list = [])

Bases: paperfetcher.datastructures.Dataset

Stores a dataset of DOIs.

DOIDatasets can be exported to pandas DataFrames, and loaded from or saved to disk in text, CSV, or Excel file formats.

Parameters: items (list) – List of DOIs (str) to store (default=[]).

Examples

To create a DOIDataset object from a list of DOIs:

>>> ds = DOIDataset(["x1.y1.z1/123123", "x2.y2.z2/456456"])

To add a DOI to the DOIDataset object:

>>> ds.append("x3.y3.z3/789789")

To export the DOIDataset object to a pandas DataFrame:

>>> df = ds.to_df()
>>> df
               DOI
0  x1.y1.z1/123123
1  x2.y2.z2/456456
3  x3.y3.z3/789789

To save data to disk:

>>> ds.save_txt("dois.txt")
>>> ds.save_csv("dois.csv")
>>> ds.save_excel("dois.xlsx")

extend_dataset(ds: paperfetcher.datastructures.DOIDataset): Appends all items from DOIDataset ds to the end of the current dataset.

save_csv(file): Saves dataset to .csv file.

save_excel(file): Saves dataset to Excel file.

save_txt(file): Saves dataset to .txt file.

to_df(): Converts dataset to DataFrame.

to_txt_string(): Returns a string which can be written to .txt file

class paperfetcher.datastructures.Dataset(items: list = [])

Bases: object

Abstract interface that defines functions for child Dataset classes to implement.

Datasets are designed to store [usually tabular] data (as input to or output from paperfetcher searches), export data to pandas DataFrames, and load/save data to disk using common data formats (txt, csv, xlsx).

Parameters: items (iterable) – Items to store in dataset (default=[]).

append(item): Adds an item to the dataset.

extend(items: list): Adds each item from a list of items to the dataset.

classmethod from_csv(file): Loads dataset from .csv file.

classmethod from_excel(file): Loads dataset from Excel file file.

classmethod from_txt(file): Loads dataset from .txt file.

save_csv(file): Saves dataset to .csv file.

save_excel(file): Saves dataset to Excel file.

save_txt(file): Saves dataset to .txt file.

to_df(): Converts dataset to DataFrame.

class paperfetcher.datastructures.HeadlessRISWriter(*, mapping: Optional[Dict] = None, list_tags: Optional[List[str]] = None, ignore: Optional[List[str]] = None, skip_unknown_tags: bool = False, enforce_list_tags: bool = True)

Bases: rispy.writer.BaseWriter

DEFAULT_LIST_TAGS: List[str] = ['A1', 'A2', 'A3', 'A4', 'AU', 'KW', 'N1']

DEFAULT_MAPPING: Dict = {'A1': 'first_authors', 'A2': 'secondary_authors', 'A3': 'tertiary_authors', 'A4': 'subsidiary_authors', 'AB': 'abstract', 'AD': 'author_address', 'AN': 'accession_number', 'AU': 'authors', 'C1': 'custom1', 'C2': 'custom2', 'C3': 'custom3', 'C4': 'custom4', 'C5': 'custom5', 'C6': 'custom6', 'C7': 'custom7', 'C8': 'custom8', 'CA': 'caption', 'CN': 'call_number', 'CY': 'place_published', 'DA': 'date', 'DB': 'name_of_database', 'DO': 'doi', 'DP': 'database_provider', 'EP': 'end_page', 'ER': 'end_of_reference', 'ET': 'edition', 'ID': 'id', 'IS': 'number', 'J2': 'alternate_title1', 'JA': 'alternate_title2', 'JF': 'alternate_title3', 'JO': 'journal_name', 'KW': 'keywords', 'L1': 'file_attachments1', 'L2': 'file_attachments2', 'L4': 'figure', 'LA': 'language', 'LB': 'label', 'M1': 'note', 'M3': 'type_of_work', 'N1': 'notes', 'N2': 'notes_abstract', 'NV': 'number_of_volumes', 'OP': 'original_publication', 'PB': 'publisher', 'PY': 'year', 'RI': 'reviewed_item', 'RN': 'research_notes', 'RP': 'reprint_edition', 'SE': 'section', 'SN': 'issn', 'SP': 'start_page', 'ST': 'short_title', 'T1': 'primary_title', 'T2': 'secondary_title', 'T3': 'tertiary_title', 'TA': 'translated_author', 'TI': 'title', 'TT': 'translated_title', 'TY': 'type_of_reference', 'UK': 'unknown_tag', 'UR': 'url', 'VL': 'volume', 'Y1': 'publication_year', 'Y2': 'access_date'}

PATTERN: str = '{tag} - {value}'

START_TAG: str = 'TY'

class paperfetcher.datastructures.RISDataset(items: list = [])

Bases: paperfetcher.datastructures.Dataset

Stores a dataset of RIS items. RIS items are rispy-readable dictionaries.

An RISDataset can be created from an RIS-formatted string, or from an RIS file. An RISDataset can be written to an RIS-formatted string, or to an RIS file. Individual item dictionaries in an RISDataset can be modified to add new tags or change the values of existing tags.

Parameters: items (list) – List of citations to store (default=[]). Each citation should be a rispy-readable dictionary.

Examples

To create an RISDataset from a list of rispy-readable dictionaries (see rispy doc on GitHub for details):

>>> dict_list = [{'journal_name': ...}, {'journal_name: ...'}, ...]
>>> ds = RISDataset(dict_list)

To load an RISDataset from an RIS-formatted string: >>> ds = RISDataset.from_ris_string(ris_string)

To load an RISDataset from an RIS file: >>> ds = RISDataset.from_ris(ris_file)

extend_dataset(ds: paperfetcher.datastructures.RISDataset): Appends all items from RISDataset ds to the end of the current dataset.

classmethod from_ris(file): Loads dataset from RIS file.

classmethod from_ris_string(ris_string): Loads dataset from RIS-formatted string.

save_ris(filename, headers=False)

Saves dataset to .ris file.

Parameters

filename – Path to file to write RIS data to.
headers (bool, default=False) – If set to true, writes reference number before each RIS entry.

to_ris_string(headers=False)

Returns a string which can be written to .ris file.

Parameters: headers (bool, default=False) – If set to true, writes reference number before each RIS entry.

paperfetcher.exceptions module

Definitions of all Exceptions raised by paperfetcher.

exception paperfetcher.exceptions.ContentNegotiationError

Bases: Exception

Exception raised when content negotiation fails.

exception paperfetcher.exceptions.DatasetError

Bases: Exception

Exception raised when an incorrect operation is performed on a dataset.

exception paperfetcher.exceptions.QueryError

Bases: Exception

Exception raised when query fails.

exception paperfetcher.exceptions.RISParsingError

Bases: Exception

Exception raised when RIS parsing fails.

exception paperfetcher.exceptions.SearchError

Bases: Exception

Exception raised when search fails.

paperfetcher.handsearch module

Classes to fetch all journal works (articles) matching a set of keywords and within a given date range by querying various APIs.

class paperfetcher.handsearch.CrossrefSearch(ISSN='', type='journal-article', keyword_list=None, from_date=None, until_date=None, batch_size=20, sort_order='desc')

Bases: object

Retrieves all works from a journal, given its ISSN, which match a set of keywords and are within a date range.

Calling a CrossrefSearch object performs the search. A search object can be called with the arguments display_progress_bar (True/False; default=True) to toggle the display of a search progress bar, select (True/False; default=False), and select_fields (list) to query only a subset of metadata for each journal article.

If select is False, a full (memory and time intensive) search is performed, fetching all metadata associated with each journal work.

If select is True, a subset of fields to fetch can be specified using the select_fields parameter. Check the Crossref REST API doc for details on which field names are permissible.

Note

Performing a search with no keywords and select=False can be very time- and memory- intensive. The search object will complain when such a search is performed.

Parameters

ISSN (str) – Journal (web) ISSN.
type (str) – Type of works to fetch (default=”journal-article”).
keyword_list (list) – List of keywords (str) to query with (default=None).
from_date (str) – Fetch articles published from (and after) this date (format=”YYYY-MM-DD”, default=None).
until_date (str) – Fetch articles published until this date (format=”YYYY-MM-DD”, default=None).
batch_size (int) – Number of works to fetch in each batch (default=20).
sort_order (str) – Order in which to sort works by date (“asc” or “desc”, default=”desc”).

ISSN

Journal (web) ISSN.

Type: str

type

Type of works to fetch (default=”journal-article”).

Type: str

keyword_list

List of keywords (str) to query with.

Type: list

from_date

Fetch articles published from (and after) this date (format=”YYYY-MM-DD”).

Type: str

until_date

Fetch articles published until this date (format=”YYYY-MM-DD”).

Type: str

batch_size

Number of works to fetch in each batch (default=20).

Type: int

sort_order

Order in which to sort works by date (“asc” or “desc”, default=”desc”).

Type: str

results

List of dictionaries, each dictionary corresponds to a work.

Type: list

Examples

>>> search = CrossrefSearch(ISSN="1520-5126", keyword_list=["hydration"], from_date="2018-01-01", until_date="2020-01-01")
>>> search()
>>> len(search)
13
>>> ds = search.get_DOIDataset()
>>> ds.to_df()
                     DOI
0   10.1021/jacs.9b09103
1   10.1021/jacs.9b06862
2   10.1021/jacs.9b09111
3   10.1021/jacs.9b05874
4   10.1021/jacs.9b02820
5   10.1021/jacs.9b05136
6   10.1021/jacs.9b02742
7   10.1021/jacs.9b00577
8   10.1021/jacs.8b11448
9   10.1021/jacs.8b12877
10  10.1021/jacs.8b11667
11  10.1021/jacs.8b08298
12  10.1021/jacs.7b11537

dry_run(select=False, select_fields=[]): How many works will this search fetch?

get_CitationsDataset(field_list=[], field_parsers_list=[])

Parses a selection of fields from search results and returns them as a CitationsDataset object.

Parameters

field_list (list) – Names of fields to parse (see Crossref REST API doc for permissible field name values).
field_parsers_list (list) – List of field parser functions corresponding to each field name. A None value means that no parser is needed for that field.

Returns

CitationsDataset

Example

>>> search = handsearch.CrossrefSearch(ISSN="1520-5126", keyword_list=["hydration"], from_date="2018-01-01",
...                                    until_date="2020-01-01")
>>> search(select=True, select_fields=['DOI', 'URL', 'title', 'author', 'issued'])
>>> ds = search.get_CitationsDataset(field_list=['DOI', 'URL', 'title', 'author', 'issued'],
...                                  field_parsers_list=[None, None, parsers.crossref_title_parser,
...                                                      parsers.crossref_authors_parser, parsers.crossref_date_parser])

get_DOIDataset()

Extracts DOIs from search results and returns them as a DOIDataset object.

Returns: DOIDataset

get_RISDataset(extra_field_list=[], extra_field_parser_list=[], extra_field_rispy_tags=[])

Extracts DOIs from search results and fetches RIS data for each DOI using Crossref’s content negotiation service.

Extra fields in the search results that are not automatically populated by Crossref’s content negotation service can be mapped to the RIS format (through rispy’s mapping) using the extra_fields, extra_field_parser_list, and extra_field_rispy_tags arguments.

Parameters

extra_field_list (list) – List of extra fields to parse and include in RIS file (see Crossref REST API doc for permissible field name values).
extra_field_parser_list (list) – List of field parser functions corresponding to each extra field name. A None value means that no parser is needed for that field.
extra_field_rispy_tags (list) – List of rispy tags for each extra field.

paperfetcher.parsers module

Functions to parse data returned from queries.

New parsers can be added here.

paperfetcher.parsers.crossref_authors_parser(author_array)

Function to parse authors.

Returns: str

paperfetcher.parsers.crossref_date_parser(date)

Function to parse date.

Returns: str

paperfetcher.parsers.crossref_title_parser(title)

Function to parse title.

Returns: str

paperfetcher.snowballsearch module

Classes to fetch all journal articles in the references of (i.e. backward search) or citing (i.e. forward search) a set of journal articles.

For backward search, you can use either Crossref or COCI (should be equivalent). For forward search, you can only use COCI at the moment.

class paperfetcher.snowballsearch.COCIBackwardReferenceSearch(search_dois: list)

Bases: object

Retrieves the (DOIs of) all articles in the references of a list of (DOIs of) articles by using the COCI REST API.

Parameters: search_dois (list) – List of DOIs (str) to fetch references of.

search_dois

List of DOIs (str) to fetch references of.

Type: list

result_dois

Set of DOIs which are referenced by the DOIs in search_dois.

Type: set

Example

>>> search = snowballsearch.COCIBackwardReferenceSearch(["10.1021/acs.jpcb.1c02191", "10.1073/pnas.2018234118"])
>>> search()
>>> len(search)
140
>>> search.result_dois
{'10.1021/jp972543+', '10.1073/pnas.0708088105',  ... ,  '10.1073/pnas.0705830104'}

classmethod from_DOIDataset(search_dataset: paperfetcher.datastructures.DOIDataset)

Constructs a search object from a DOIDataset.

Parameters: search_dataset (DOIDataset) – Dataset of DOIs to fetch references of.
Returns: COCIBackwardReferenceSearch

get_DOIDataset()

Returns search results as a DOIDataset object.

Returns: DOIDataset

get_RISDataset()

Returns search results as an RISDataset object. Uses the Crossref REST API for content negotation.

Returns: RISDataset

class paperfetcher.snowballsearch.COCIForwardCitationSearch(search_dois: list)

Bases: object

Retrieves the (DOIs of) all articles citing a list of (DOIs of) articles by using the COCI REST API.

Parameters: search_dois (list) – List of DOIs (str) to fetch citations of.

search_dois

List of DOIs (str) to fetch citations of.

Type: list

result_dois

Set of DOIs which cite the DOIs in search_dois.

Type: set

Example

>>> search = snowballsearch.COCIForwardCitationSearch(["10.1021/acs.jpcb.8b11423", "10.1073/pnas.2018234118"])
>>> search()
>>> len(search)
11
>>> search.result_dois
{'10.1039/c9sc02097g', '10.1021/acs.jpcb.1c05748', ... , '10.1021/acs.jpclett.9b02052'}

classmethod from_DOIDataset(search_dataset: paperfetcher.datastructures.DOIDataset)

Constructs a search object from a DOIDataset.

Parameters: search_dataset (DOIDataset) – Dataset of DOIs to fetch references of.
Returns: COCIForwardReferenceSearch

get_DOIDataset()

Returns search results as a DOIDataset object.

Returns: DOIDataset

get_RISDataset()

Returns search results as an RISDataset object. Uses the Crossref REST API for content negotation.

Returns: RISDataset

class paperfetcher.snowballsearch.CrossrefBackwardReferenceSearch(search_dois: list)

Bases: object

Retrieves (the DOIs of) all articles in the references of a list of (DOIs of) articles by using the Crossref REST API.

Parameters: search_dois (list) – List of DOIs (str) to fetch references of.

search_dois

List of DOIs (str) to fetch references of.

Type: list

result_dois

Set of DOIs which are referenced by the DOIs in search_dois.

Type: set

Example

>>> search = snowballsearch.CrossrefBackwardReferenceSearch(["10.1021/acs.jpcb.1c02191", "10.1073/pnas.2018234118"])
>>> search()
>>> len(search)
140
>>> search.result_dois
{'10.1021/jp972543+', '10.1073/pnas.0708088105',  ... ,  '10.1073/pnas.0705830104'}

classmethod from_DOIDataset(search_dataset: paperfetcher.datastructures.DOIDataset)

Constructs a search object from a DOIDataset.

Parameters: search_dataset (DOIDataset) – Dataset of DOIs to fetch references of.
Returns: CrossrefBackwardReferenceSearch

get_DOIDataset()

Returns search results as a DOIDataset object.

Returns: DOIDataset

get_RISDataset()

Returns search results as an RISDataset object. Uses the Crossref REST API for content negotation.

Returns: RISDataset

Module contents

class paperfetcher.GlobalConfig

Bases: object

crossref_plus = False

crossref_plus_auth_token = ''

crossref_useragent = 'paperfetcher/0.0.1 (https://github.com/paperfetcher/paperfetcher; mailto:pallathakash@gmail.com)'

loglevel = 20

streamlit = False