Is there a way to scan a pdf to ensure it doesn't contain anything that could be a virus?

The answers to Can a PDF file contain a virus? show that clearly it can! Sometimes we can be quite sure a certain pdf should not need to do anything sophisticated - for example a book in pdf form - so we wouldn't expect them to contain embedded executables, or similarly more complex items, like javascripts, and if they did, they could be avoided or treated with extra precaution.

Question

Is there a simple way on macOS and Windows to ensure that any URLs ending in .pdf are scanned for anything more complicated than text and images (the things we'd expect to find in, say, a book), and only opened/downloaded/viewed if it passes the check? Note: I know many harmless pdfs contain some complex behaviours, but I'd prefer to turn the check off for those specific cases (i.e. if they're from a trusted source), rather than allowing potentially malicious behaviour.

asked Mar 13, 2022 at 23:40 1,320 2 2 gold badges 11 11 silver badges 20 20 bronze badges Commented Mar 14, 2022 at 0:05

Is there a specific reason to believe, that standard virus scanners would not suffice? Your question does not address this IMHO obvious solution.

Commented Mar 14, 2022 at 7:02

3 Answers 3

Ensure? No. A simple reason: Images, layout information, fonts, and all sorts of other "simple" data can nonetheless be malicious, and can lead to arbitrary code execution if the parser for them has an exploitable bug (a.k.a. a vulnerability). This is not academic; lots of exploits, including some quite famous ones, were carried out through image or font parsers.

Similarly, any scanner that you could use to theoretically validate the contents of a PDF could, itself, be vulnerable. After all, it too is parsing the file, and there's nothing that says security tools can't contain vulnerabilities themselves. In fact, adding a security tool always increases the attack surface - the amount of space where a vulnerability could exist - and there is no way to guarantee that the tool, even if not itself vulnerable, will reliably detect malicious data without passing it on to other code.

You could, in theory, have a PDF reader that doesn't handle any but the most common and trusted formats; it wouldn't be able to open everything (not even every book), but it could open most of them (probably all from most publishers, etc.). It wouldn't be totally safe - even common and trusted code can have vulnerabilities that lurk undetected for over a decade. I don't know of any PDF reader that has this feature (and specific product recommendations are out of scope for this site anyhow), but you might be able to find one if you look.

Another option would be a PDF validator. As mentioned above, this does add attack surface (the validator itself), but in theory a validator could apply strict validation without attempting to render the font/image/layout/whatever, which reduces the risk somewhat, and would probably throw out anything that isn't safe (not guaranteed, but probably) without being at risk itself (unless the validator was software somebody specifically targeted, or was rather shoddily written).

One way to mitigate all these risks is to handle the PDFs in a sandbox, a low-privilege process with minimal and strictly-controlled access to the rest of the system. Sandboxing is quite common, including for PDFs - Adobe Reader was one of the first really popular desktop programs that I know of to include a sandbox (other than browsers; Adobe adapted the one Chrome was already using) - and is used for approximately all apps on mobile devices and most apps from the desktop Windows Store and MacOS App Store. Mind you, sandboxes aren't a perfect solution - they don't restrict everything, and even stuff that they do try to restrict might be possible if the sandbox is itself buggy (as pretty much all complex software is) in the right way. Still, it adds defense in depth.