`svgdigitizer.pdf`

Module for handling operations related to PDF files.

class svgdigitizer.pdf.Pdf(pdf_filepath, doi=None)

Handles all interactions with the PDF file.

property bibliographic_entry

Get the citation from the DOI provided PDF file. Returns a bibtex string.

EXAMPLES:

>>> from svgdigitizer.pdf import Pdf
>>> from svgdigitizer.test.cli import TemporaryData
>>> with TemporaryData("**/Hermann_2018_J._Electrochem._Soc._165_J3192.pdf") as directory:
...     # do not assign instance to variable which keeps the file open and fails for windows
...     Pdf(os.path.join(directory, "Hermann_2018_J._Electrochem._Soc._165_J3192.pdf")).bibliographic_entry
'@article{Hermann_2018, title={Enhanced Electrocatalytic Oxidation ... year={2018}, pages={J3192–J3198} }'

static build_identifier(citation, skip_words=('a', 'ab', 'aboard', 'about', 'above', 'across', 'after', 'against', 'al', 'along', 'amid', 'among', 'an', 'and', 'anti', 'around', 'as', 'at', 'before', 'behind', 'below', 'beneath', 'beside', 'besides', 'between', 'beyond', 'but', 'by', 'd', 'da', 'das', 'de', 'del', 'dell', 'dello', 'dei', 'degli', 'della', 'dell', 'delle', 'dem', 'den', 'der', 'des', 'despite', 'die', 'do', 'down', 'du', 'during', 'ein', 'eine', 'einem', 'einen', 'einer', 'eines', 'el', 'en', 'et', 'except', 'for', 'from', 'gli', 'i', 'il', 'in', 'inside', 'into', 'is', 'l', 'la', 'las', 'le', 'les', 'like', 'lo', 'los', 'near', 'nor', 'of', 'off', 'on', 'onto', 'or', 'over', 'past', 'per', 'plus', 'round', 'save', 'since', 'so', 'some', 'sur', 'than', 'the', 'through', 'to', 'toward', 'towards', 'un', 'una', 'unas', 'under', 'underneath', 'une', 'unlike', 'uno', 'unos', 'until', 'up', 'upon', 'versus', 'via', 'von', 'while', 'with', 'within', 'without', 'yet', 'zu', 'zum'))

Build the entry identifier based on a bibtex citation provided as BibliographyData (pybtex). Some common title words are omitted from the identifier by default (see skip_words). To change the omitted words it is possible to provide a custom word list.

Examples:

A complex title:

>>> from svgdigitizer.pdf import Pdf
>>> from pybtex.database import parse_string
>>> bibtex_string = (
...     '@article{Mar_Ol__2021, title={Surfaces are made by the devil: fooo,2 of XX(110)/other stuff.},'
...     ' volume={145}, ISSN={0015-0057}, url={http://dx.doi.org/10.1016/j.what.2015.123456},'
...     ' DOI={10.1016/j.what.2015.123456}, journal={My Journal}, publisher={Publisher},'
...     ' author={Marí-Olé, J. and Foo, B.}, year={2013}, month=dec, pages={4567} }'
... )
>>> bibliography_data = parse_string(bibtex_string, bib_format="bibtex")
>>> Pdf.build_identifier(bibliography_data)
'mari-ole_2013_surfaces_4567'

Special characters are skipped upon first word selection:

>>> bibtex_string = (
...     '@article{hermannEffectPHAnion2021,'
...     ' title = {The {{Effect}} of {{pH}} and {{Anion Adsorption}} on {{Formic Acid Oxidation}} on {{Au}}(111) {{Electrodes}}},'
...     ' author = {Hermann, Johannes M. and Abdelrahman, Areeg and Jacob, Timo and Kibler, Ludwig A.},'
...     ' year = 2021, month = jul, journal = {Electrochim. Acta},'
...     ' volume = {385}, pages = {138279}, issn = {0013-4686}, doi = {10.1016/j.electacta.2021.138279} }'
... )
>>> bibliography_data = parse_string(bibtex_string, bib_format="bibtex")
>>> Pdf.build_identifier(bibliography_data)
'hermann_2021_effect_138279'

Keep only the first meaningful word of the title:

>>> bibtex_string = (
...     '@article{Hermann_2018, title={An in the foo bar article}, volume={165},'
...     ' ISSN={0013-4651}, url={http://dx.doi.org/10.1149/2.0221810jes},'
...     ' DOI={10.1149/2.0221810jes}, journal={Journal of The Electrochemical Society},'
...     ' publisher={The Electrochemical Society},'
...     ' author={Hermann, Johannes M. and Jacob, Timo and Kibler, Ludwig A.},'
...     ' year={2018}, month=aug, pages={J3192–J3198} }'
... )
>>> bibliography_data = parse_string(bibtex_string, bib_format="bibtex")
>>> Pdf.build_identifier(bibliography_data)
'hermann_2018_foo_j3192'

Dashes in the first word are considered:

>>> bibtex_string = (
...     '@article{Hermann_2018, title={An-in the foo bar article}, volume={165},'
...     ' ISSN={0013-4651}, url={http://dx.doi.org/10.1149/2.0221810jes},'
...     ' DOI={10.1149/2.0221810jes}, journal={Journal of The Electrochemical Society},'
...     ' publisher={The Electrochemical Society},'
...     ' author={Hermann, Johannes M. and Jacob, Timo and Kibler, Ludwig A.},'
...     ' year={2018}, month=aug, pages={J3192–J3198} }'
... )
>>> bibliography_data = parse_string(bibtex_string, bib_format="bibtex")
>>> Pdf.build_identifier(bibliography_data)
'hermann_2018_an-in_j3192'

Special LaTex characters are translated to unicode:

>>> bibtex_string = (
... r"@article{ fooo-bair_2012_random_110, "
... r"author = {\'Alvaro-Monta{\~n}a, Baz and Author, Two}, "
... r"title = {Ramdon title}, journal = {Journal}, volume = {3}, pages = {110--123}, "
... r"year = {2012}, publisher = {Publisher} }"
... )
>>> bibliography_data = parse_string(bibtex_string, bib_format="bibtex")
>>> Pdf.build_identifier(bibliography_data)
'alvaro-montana_2012_ramdon_110'

For some publications no pages are included in bibtex:

>>> bibtex_string = (
...     "@article{White_2026, title={Emergent quantization from a dynamic vacuum},"
...     " volume={8}, ISSN={2643-1564}, url={http://dx.doi.org/10.1103/l8y7-r3rm},"
...     " DOI={10.1103/l8y7-r3rm}, number={1}, journal={Physical Review Research},"
...     " publisher={American Physical Society (APS)},"
...     " author={White, Harold and Vera, Jerry and Sylvester, Andre and Dudzinski, Leonard},"
...     " year={2026}, month=Mar }"
... )
>>> bibliography_data = parse_string(bibtex_string, bib_format="bibtex")
>>> Pdf.build_identifier(bibliography_data)
'white_2026_emergent'

property doc: Holds the opened PDF file.

property doi: Extract the DOI from the provided PDF or the provided string. Since in some cases additional pages are prepended to the PDF, the DOI is extracted from either the first or second page.

export_png(page_idx, dpi): Returns the requested PDF page as PNG with specified DPI.

property num_pages: Number of pages of the PDF file.

rename_by_key(): Rename the provided PDF file by the key derived from citation.

svgdigitizer.pdf

`svgdigitizer.pdf`