In this post, I’ll show you how to parse a PDF document in Angular.
We will extract full text content of a PDF file using pdfjs-dist
library.
Generate a new Angular application, if you do not have one already:
ng new PdfReader
Code language: Bash (bash)
Install the latest version of pdfjs-dist
from npm:
npm install pdfjs-dist
Code language: Bash (bash)
Let’s create a service, which will contain the logic to parse PDF:
ng generate service PdfReader
Code language: Bash (bash)
Replace the contents of the generated service with the following:
import { Injectable } from '@angular/core';
import * as pdfjsLib from 'pdfjs-dist';
@Injectable({
providedIn: 'root'
})
export class PdfReaderService {
constructor() {
pdfjsLib.GlobalWorkerOptions.workerSrc = '//mozilla.github.io/pdf.js/build/pdf.worker.js';
}
public async readPdf(pdfUrl: string): Promise<string> {
const pdf = await pdfjsLib.getDocument(pdfUrl);
const countPromises = []; // collecting all page promises
for (let i = 1; i <= pdf._pdfInfo.numPages; i++) {
const page = await pdf.getPage(i);
const textContent = await page.getTextContent();
countPromises.push(textContent.items.map((s) => s.str).join(''));
}
const pageContents = await Promise.all(countPromises);
return pageContents.join('');
}
}
Code language: TypeScript (typescript)
Our service configures the worker source of pdfjs-dist
.
Given a URL pointing to a PDF document, readPdf
method retrieves the file.
For demonstration, I am using the app.component.ts
to read the contents of the pdf file:
import { Component, OnInit } from '@angular/core';
import { PdfReaderService } from './pdf-reader.service';
@Component({
selector: 'app-root',
templateUrl: './app.component.html',
styleUrls: ['./app.component.css']
})
export class AppComponent implements OnInit {
constructor(private pdfReader: PdfReaderService) { }
ngOnInit() {
this.pdfReader.readPdf('./assets/sample.pdf')
.then(text => alert('PDF parsed: ' + text), reason => console.error(reason));
}
}
Code language: TypeScript (typescript)
In the example above, I have an example pdf document under src/assets
, called sample.pdf
.
Run the application with ng serve
and the file contents appear in an alert dialog:
Don’t know when it was written, but found an error pdfjsLib.getDocument has to be referred as a pdfDocumentLoadingTask and then from there we should get a promise. so should be something like
const pdfLoadingTask = pdfjsLib.getDocument(pdfUrl);
pdfLoadingTask.promise.then((pdf) => {
for(let i = 1; i <= pdf._pdfInfo.numPages; i++) {
….
or const pdf = await pdfjsLib.getDocument(pdfUrl).promise;
which was missed in your code.
Anyways great article and great example. Thanks for showing this.
Thank you for the detailed tutorial.
I am facing issue in :
countPromises.push(textContent.items.map((s) => s.str).join(”));
‘s.str’ throws error – “Property ‘str’ does not exist on type ‘TextItem | TextMarkedContent’.
Property ‘str’ does not exist on type ‘TextMarkedContent’.”
Any help would be appreciated.
Hey, Sajan.
I was able to fix this.
so as I investigated textContent.items is of type TextItem | TextMarkedContent so if we use (s as TextItem).str instead of just s.str we’d get all compiled and running.
Hi,
I tried to use the code in the article, and ng serve throws an error related to _pdfInfo and getPage, saying that they do not exist on type PDFDocumentLoadingTask.
Could you give any suggestions on how to fix that?
Thanks
use pdf._worker.getPage() to fix the issue
As I wrote in my comment you have to add .promise in PdfReaderService class. So that you’d have something like this
const pdf = await pdfjsLib.getDocument(pdfUrl).promise;
with that everything works like a charm
Hi! How can I get the file in the form?
You would need to save it somewhere accessible via a URL, for example to a backend server. Then pass the URL to pdfjs to read, good luck!