You are currently viewing How to Use PDFjs to Parse PDF Documents in Angular
Parse PDF documents with pdfjs in Angular

How to Use PDFjs to Parse PDF Documents in Angular

In this post, I’ll show you how to parse a PDF document in Angular.

We will extract full text content of a PDF file using pdfjs-dist library.

Generate a new Angular application, if you do not have one already:

ng new PdfReaderCode language: Bash (bash)

Install the latest version of pdfjs-dist from npm:

npm install pdfjs-distCode language: Bash (bash)

Let’s create a service, which will contain the logic to parse PDF:

ng generate service PdfReaderCode language: Bash (bash)

Replace the contents of the generated service with the following:

import { Injectable } from '@angular/core';
import * as pdfjsLib from 'pdfjs-dist';
@Injectable({
  providedIn: 'root'
})
export class PdfReaderService {
  constructor() {
    pdfjsLib.GlobalWorkerOptions.workerSrc = '//mozilla.github.io/pdf.js/build/pdf.worker.js';
  }
  public async readPdf(pdfUrl: string): Promise<string> {
    const pdf = await pdfjsLib.getDocument(pdfUrl);
    const countPromises = []; // collecting all page promises
    for (let i = 1; i <= pdf._pdfInfo.numPages; i++) {
      const page = await pdf.getPage(i);
      const textContent = await page.getTextContent();
      countPromises.push(textContent.items.map((s) => s.str).join(''));
    }
    const pageContents = await Promise.all(countPromises);
    return pageContents.join('');
  }
}Code language: TypeScript (typescript)

Our service configures the worker source of pdfjs-dist.

Given a URL pointing to a PDF document, readPdf method retrieves the file.

For demonstration, I am using the app.component.ts to read the contents of the pdf file:

import { Component, OnInit } from '@angular/core';
import { PdfReaderService } from './pdf-reader.service';
@Component({
  selector: 'app-root',
  templateUrl: './app.component.html',
  styleUrls: ['./app.component.css']
})
export class AppComponent implements OnInit {
  constructor(private pdfReader: PdfReaderService) { }
 ngOnInit() {
    this.pdfReader.readPdf('./assets/sample.pdf')
      .then(text => alert('PDF parsed: ' + text), reason => console.error(reason));
  }
}Code language: TypeScript (typescript)

In the example above, I have an example pdf document under src/assets, called sample.pdf.

Run the application with ng serve and the file contents appear in an alert dialog:

Umut Esen

I am a software developer and blogger with a passion for the world wide web.

Leave a Reply

This Post Has 8 Comments

  1. Artyom

    Don’t know when it was written, but found an error pdfjsLib.getDocument has to be referred as a pdfDocumentLoadingTask and then from there we should get a promise. so should be something like
    const pdfLoadingTask = pdfjsLib.getDocument(pdfUrl);
    pdfLoadingTask.promise.then((pdf) => {
    for(let i = 1; i <= pdf._pdfInfo.numPages; i++) {
    ….

    or const pdf = await pdfjsLib.getDocument(pdfUrl).promise;
    which was missed in your code.

    Anyways great article and great example. Thanks for showing this.

  2. Sajan

    Thank you for the detailed tutorial.

    I am facing issue in :
    countPromises.push(textContent.items.map((s) => s.str).join(”));
    ‘s.str’ throws error – “Property ‘str’ does not exist on type ‘TextItem | TextMarkedContent’.
    Property ‘str’ does not exist on type ‘TextMarkedContent’.”

    Any help would be appreciated.

    1. Artyom

      Hey, Sajan.
      I was able to fix this.
      so as I investigated textContent.items is of type TextItem | TextMarkedContent so if we use (s as TextItem).str instead of just s.str we’d get all compiled and running.

  3. Ciprian

    Hi,
    I tried to use the code in the article, and ng serve throws an error related to _pdfInfo and getPage, saying that they do not exist on type PDFDocumentLoadingTask.

    Could you give any suggestions on how to fix that?

    Thanks

    1. c

      use pdf._worker.getPage() to fix the issue

    2. Artyom

      As I wrote in my comment you have to add .promise in PdfReaderService class. So that you’d have something like this

      const pdf = await pdfjsLib.getDocument(pdfUrl).promise;

      with that everything works like a charm

  4. Igor

    Hi! How can I get the file in the form?

    1. Umut Esen

      You would need to save it somewhere accessible via a URL, for example to a backend server. Then pass the URL to pdfjs to read, good luck!