Analyzing Malicious PDF with peepdf

Perform static analysis of malicious PDF documents using peepdf, pdfid, and pdf-parser to extract embedded JavaScript,

What Is This

"Analyzing Malicious PDF with peepdf" is a cybersecurity skill focused on using peepdf and companion tools like pdfid and pdf-parser to perform static analysis of potentially malicious PDF documents. This technique allows analysts, incident responders, and forensic investigators to identify and extract embedded JavaScript, shellcode, and other suspicious objects within PDF files. By leveraging these tools, you can safely dissect weaponized PDF documents without executing the payload, ensuring a controlled and comprehensive examination.

peepdf is a powerful Python-based tool designed for detailed PDF analysis. It provides an interactive shell, enabling analysts to traverse the internal object structure of PDFs, decode streams, and extract embedded content. Complementary tools like pdfid and pdf-parser, developed by Didier Stevens, allow for rapid triage and in-depth object parsing, forming a robust workflow for static PDF malware analysis.

Why Use It

Malicious PDF documents are commonly used in phishing campaigns and targeted attacks. Attackers often embed harmful JavaScript, exploits, or even executable payloads within PDFs to compromise end-user systems. Traditional antivirus solutions may fail to detect such threats, especially if obfuscation or novel exploitation techniques are used.

Static analysis using peepdf and related tools offers several advantages:

  • Safe Inspection: Examine the PDF’s structure and content without triggering any embedded exploits or payloads.
  • Detailed Object Analysis: Identify suspicious objects, streams, and encoded data that could harbor malicious code.
  • Extraction of Artifacts: Extract and analyze embedded JavaScript, shellcode, and files for further reverse engineering or sandbox execution.
  • Signature Development: Gather indicators of compromise (IOCs) and behavioral patterns to enhance detection rules for security tools.
  • Forensic Integrity: Maintain evidence integrity by working in a controlled, read-only manner.

This skill is essential for malware analysts, DFIR (Digital Forensics and Incident Response) professionals, and security engineers who need to dissect, understand, and document PDF-based threats.

How to Use It

Prerequisites

Before starting, ensure you have:

  • Python 3.8+ with peepdf-3 installed (pip install peepdf-3)
  • Didier Stevens’ pdfid.py and pdf-parser.py (download from his GitHub repository)
  • An isolated analysis environment (such as a virtual machine or sandbox)
  • Optionally, PyV8 for JavaScript emulation and Pylibemu for shellcode analysis

Step 1:

Triage with pdfid

Begin by scanning the suspicious PDF for known exploit indicators using pdfid:

python pdfid.py suspicious.pdf

Look for the presence of suspicious keywords, such as /JavaScript, /JS, /OpenAction, /Launch, and /EmbeddedFile. A positive result for these keywords suggests the PDF may contain embedded scripts or payloads.

Step 2:

Parse Objects with pdf-parser

Use pdf-parser to identify and extract objects of interest:

python pdf-parser.py suspicious.pdf

To filter for objects containing JavaScript:

python pdf-parser.py suspicious.pdf --search javascript

To dump the content of a specific object (for example, object 5):

python pdf-parser.py suspicious.pdf -o 5 -d

Step 3:

Interactive Analysis with peepdf

Invoke peepdf in interactive shell mode:

peepdf suspicious.pdf

Inside the peepdf shell, you can list all objects:

> objects

To inspect a specific object (e.g., object 7):

> info 7

To dump and decode suspicious streams:

> stream 7

peepdf can automatically decode common encodings (FlateDecode, ASCIIHexDecode, etc.), making it easier to analyze obfuscated content.

Step 4:

Extract and Analyze Embedded Content

When you encounter objects with embedded JavaScript or files, extract them:

> js_unescape 10

This command attempts to deobfuscate JavaScript code in object 10. For embedded files:

> extract 15

The extracted artifacts can be further analyzed with external tools (e.g., running JavaScript in a controlled emulator, or examining shellcode with disassemblers).

Step 5:

Optional - Emulate JavaScript and Analyze Shellcode

If PyV8 is installed, peepdf can emulate JavaScript code, helping you understand its behavior without executing it on a live system. Pylibemu can be used to analyze shellcode, identifying its type and intent.

When to Use It

This skill is particularly useful in the following scenarios:

  • Triaging suspicious PDF attachments received via phishing emails
  • Investigating malware campaigns leveraging PDF exploits
  • Extracting and analyzing embedded JavaScript or shellcode for threat intelligence
  • Forensic examination of weaponized documents in incident response cases
  • Developing custom detection signatures for PDF-based malware

Important Notes

  • Isolated Environment: Always perform PDF malware analysis in a sandbox or virtual machine to prevent accidental execution of malicious code.
  • Legal Considerations: Only analyze PDFs you are authorized to handle. Handling live malware may have legal and ethical implications.
  • Tool Updates: Keep peepdf and companion tools updated to handle evolving PDF formats and obfuscation techniques.
  • Limitations: While static analysis is powerful, some advanced threats may use encryption or multi-stage payloads that require dynamic analysis for full understanding.

By following this structured approach, you can safely and effectively analyze suspicious PDFs, extract malicious artifacts, and support broader security operations. This skill is a core component of modern malware analysis and digital forensics workflows.