{"id":249,"date":"2017-11-23T04:44:38","date_gmt":"2017-11-23T04:44:38","guid":{"rendered":"http:\/\/www.yohz.com\/blogs\/?p=249"},"modified":"2019-09-05T10:26:18","modified_gmt":"2019-09-05T02:26:18","slug":"compressing-pdfs-in-your-database","status":"publish","type":"post","link":"https:\/\/www.yohz.com\/blogs\/2017\/11\/23\/compressing-pdfs-in-your-database\/","title":{"rendered":"Compressing PDFs in your database"},"content":{"rendered":"<p><em>\u2192 This article refers to SQL Blob Viewer, which has now been renamed to <a href=\"http:\/\/www.yohz.com\/siv8_details.htm\">SQL Image Viewer<\/a>.\u00a0 The techniques described in this blog is still applicable, as the functionality of the product remains the same.\u00a0 Only the name has changed.<\/em><\/p>\n<p>Recently, a user wanted to compress PDFs stored in his database, in order to reduce the overall size of the database.\u00a0 He asked if we had any application that could do this.\u00a0 Unfortunately, we don&#8217;t, but it got me to exploring the available options.<\/p>\n<p>Turns out that PDF software development kits aren&#8217;t cheap at all.\u00a0 Licensing can run into thousands of dollars, which isn&#8217;t feasible for us.\u00a0 Open-source software is another option, which is what I finally went with.\u00a0 In this case, I used Ghostscript, an all-purpose PDF toolkit, available at\u00a0<a href=\"https:\/\/www.ghostscript.com\/\">https:\/\/www.ghostscript.com\/<\/a>.<\/p>\n<p>There are 3 steps to compressing PDFs in your database &#8211; extracting the PDFs, compressing or optimizing them, and finally uploading them back into the database.\u00a0 We will use <a href=\"http:\/\/www.yohz.com\/sbv_details.htm\">SQL Blob Viewer<\/a> to first extract the PDFs, then Ghostscript to reduce the PDF size, and finally <a href=\"http:\/\/www.yohz.com\/sfi_overview.htm\">SQL File Import<\/a> to upload the PDFs back into the database.\u00a0 For reference purposes, these PDFs were created from document scans, so they have a 600 dpi resolution and are not optimized for PDF storage.\u00a0 We&#8217;re running this example on Windows, but there is also a Linux version of Ghostscript, and both SQL Blob Viewer and SQL File Import will run on Linux using Wine.<\/p>\n<p>&nbsp;<\/p>\n<p><strong>Extracting the PDFs<\/strong><\/p>\n<p>Extracting PDFs from your database using <a href=\"http:\/\/www.yohz.com\/sbv_details.htm\">SQL Blob Viewer<\/a> is very simple &#8211; first write the SQL command to\u00a0retrieve the PDFs.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-252\" src=\"http:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs_01.png\" alt=\"\" width=\"800\" height=\"600\" srcset=\"https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs_01.png 800w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs_01-300x225.png 300w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs_01-768x576.png 768w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs_01-150x113.png 150w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/><\/p>\n<p>We then export the PDF files to disk, using the primary key value in the <em>ID<\/em> field to name the exported files.\u00a0 We do this so that when we upload the compressed files, we can use the <em>ID<\/em> value to update the correct rows.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-253\" src=\"http:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs_02.png\" alt=\"\" width=\"746\" height=\"414\" srcset=\"https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs_02.png 746w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs_02-300x166.png 300w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs_02-150x83.png 150w\" sizes=\"(max-width: 746px) 100vw, 746px\" \/><\/p>\n<p>If you have a lot of PDFs to export, you should choose to retrieve only the first few rows, to avoid loading the entire data set into memory.\u00a0 After that, when you export the result set, the entire result set will be exported.\u00a0 See this <a href=\"http:\/\/www.yohz.com\/sbv_faq_images.htm\">page <\/a>for details on how to export large result sets with SQL Blob Viewer.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-266\" src=\"http:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs03-1.png\" alt=\"\" width=\"800\" height=\"263\" srcset=\"https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs03-1.png 800w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs03-1-300x99.png 300w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs03-1-768x252.png 768w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs03-1-150x49.png 150w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p><strong>Compressing\u00a0the PDFs<\/strong><\/p>\n<p>Now that we&#8217;ve exported the files, it&#8217;s time to use <a href=\"https:\/\/www.ghostscript.com\/\">Ghostscript <\/a>to compress the images found in those PDFs.<\/p>\n<p>The easiest way to do this is to reduce the resolution of the images.\u00a0 You can do this using the <strong>PDFSETTINGS<\/strong> option.\u00a0 The possible values are:<\/p>\n<ul>\n<li>\/screen &#8211; converts to 72 dpi<\/li>\n<li>\/ebook &#8211; converts to 150 dpi<\/li>\n<li>\/printer &#8211; converts to 300 dpi<\/li>\n<li>\/prepress &#8211; converts to 300 dpi, color preserving<\/li>\n<\/ul>\n<p>Depending on your requirements, you might want to test the various options to see which best suits your needs.\u00a0 I took one of the exported PDFs, and converted them using each of the 4 options.\u00a0 As you can see, the size of the PDF drops dramatically for all 4 options.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-262\" src=\"http:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs09.png\" alt=\"\" width=\"897\" height=\"427\" srcset=\"https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs09.png 897w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs09-300x143.png 300w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs09-768x366.png 768w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs09-150x71.png 150w\" sizes=\"(max-width: 897px) 100vw, 897px\" \/><\/p>\n<p>Here is the DOS batch script I used to convert the PDFs using the \/prepress option (<strong>NOTE<\/strong>: Ghostscript options are case-sensitive, so you cannot for e.g write -PDFSETTINGS as -PDFSettings):<\/p>\n<p style=\"padding-left: 30px;\">for %%x in (*.pdf)\u00a0 do gswin64c.exe -sDEVICE=pdfwrite -dPDFSETTINGS=\/prepress -dBATCH -dNOPAUSE -dQUIET -SOutputFile=&#8221;%%~nx_compressed.pdf&#8221; %%x<\/p>\n<p>The options used are:<\/p>\n<ul>\n<li><strong>-sDEVICE<\/strong>=pdfwrite &#8211; this tells Ghostscript that we want to create a PDF file<\/li>\n<li>&#8211;<strong>dPDFSETTINGS<\/strong>=\/prepress &#8211; this tells Ghostscript to convert all images found in the source PDF to 300 dpi resolution<\/li>\n<li><strong>-dBATCH -dNOPAUSE -dQUIET<\/strong> &#8211; these options indicate that the process should run non-interactively<\/li>\n<li><strong>-SOutputFile<\/strong>=&#8217;%%~nx_compressed.pdf&#8217; &#8211; this tells Ghostscript how to name the output file.\u00a0 Since we want to add a <em>_compressed<\/em> suffix, we first use the ~nx option to extract just the source file name without the extension, add the <em>_compressed<\/em> suffix, followed by the <em>.pdf<\/em> extension.<\/li>\n<li>%%x &#8211; this is the source file name that matches the search pattern in the <em>for %%x in (*.pdf)<\/em> loop<\/li>\n<\/ul>\n<p>Basically, this loops through all the PDFs in the current folder, converts all images in each PDF to 300 dpi, and saves the PDF with the <em>_compressed<\/em> suffix.<\/p>\n<p>As you can see, the new PDFs are significantly smaller than the original PDFs.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-263\" src=\"http:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs04-1.png\" alt=\"\" width=\"897\" height=\"594\" srcset=\"https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs04-1.png 897w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs04-1-300x199.png 300w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs04-1-768x509.png 768w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs04-1-150x99.png 150w\" sizes=\"(max-width: 897px) 100vw, 897px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p><strong>Updating the database<\/strong><\/p>\n<p>Now, we need to update the existing record with the optimized PDF file.\u00a0 We can do this using <a href=\"http:\/\/www.yohz.com\/sfi_overview.htm\">SQL File Import<\/a>.\u00a0 First, we enter the search pattern for the files we want to use i.e. those with the <em>_compressed<\/em> suffix.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-257\" src=\"http:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs05.png\" alt=\"\" width=\"680\" height=\"680\" srcset=\"https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs05.png 680w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs05-150x150.png 150w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs05-300x300.png 300w\" sizes=\"(max-width: 680px) 100vw, 680px\" \/><\/p>\n<p>Next, we need to map the columns.\u00a0 Using the file name as the input value for the <em>ID<\/em> column, we need to:<\/p>\n<ul>\n<li>extract the ID value from the file name<\/li>\n<li>indicate that this value is a key field<\/li>\n<li>indicate that this is an update process<\/li>\n<\/ul>\n<p>We do this via the following script:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-260\" src=\"http:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs06-1.png\" alt=\"\" width=\"824\" height=\"782\" srcset=\"https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs06-1.png 824w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs06-1-300x285.png 300w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs06-1-768x729.png 768w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs06-1-150x142.png 150w\" sizes=\"(max-width: 824px) 100vw, 824px\" \/><\/p>\n<p>For the <em>attachment <\/em>column, we simply indicate to <a href=\"http:\/\/www.yohz.com\/sfi_overview.htm\">SQL File Import<\/a> that we want to use the file contents.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-259\" src=\"http:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs07.png\" alt=\"\" width=\"824\" height=\"782\" srcset=\"https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs07.png 824w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs07-300x285.png 300w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs07-768x729.png 768w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs07-150x142.png 150w\" sizes=\"(max-width: 824px) 100vw, 824px\" \/><\/p>\n<p>Internally, <a href=\"http:\/\/www.yohz.com\/sfi_overview.htm\">SQL File Import<\/a> will form the following UPDATE statement based on our script as follows:<\/p>\n<p style=\"padding-left: 30px;\">UPDATE attachments SET attachment = :attachment WHERE ID = :ID<\/p>\n<p>The test script shows that we have extracted the <em>ID<\/em> value correctly, and that the <em>attachment<\/em> column will use the contents of the files.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-261\" src=\"http:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs08.png\" alt=\"\" width=\"824\" height=\"680\" srcset=\"https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs08.png 824w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs08-300x248.png 300w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs08-768x634.png 768w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs08-150x124.png 150w\" sizes=\"(max-width: 824px) 100vw, 824px\" \/><\/p>\n<p>Now, we just need to run the script in <a href=\"http:\/\/www.yohz.com\/sfi_overview.htm\">SQL File Import<\/a>, and our records are updated with compressed versions of the PDF files.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-256\" src=\"http:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs10.png\" alt=\"\" width=\"800\" height=\"600\" srcset=\"https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs10.png 800w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs10-300x225.png 300w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs10-768x576.png 768w, https:\/\/www.yohz.com\/blogs\/wp-content\/uploads\/2017\/11\/gs10-150x113.png 150w\" sizes=\"(max-width: 800px) 100vw, 800px\" \/><\/p>\n<p>That is basically all you need to do if you want to reduce the size of the PDFs in your database.\u00a0 The steps are similar if you want to process any of your blob data and update them in your database e.g.<\/p>\n<ul>\n<li>resize images<\/li>\n<li>compress files into archive (zip) files<\/li>\n<li>process images e.g. add watermarks, convert to grayscale etc<\/li>\n<\/ul>\n<p><a href=\"http:\/\/www.yohz.com\/sbv_details.htm\">SQL Blob Viewer<\/a> and <a href=\"http:\/\/www.yohz.com\/sfi_overview.htm\">SQL File Import<\/a> will handle the extraction and update process respectively.\u00a0 You are free to use any external tools to process your images\/files.<\/p>\n<div class=\"fcbkbttn_buttons_block\" id=\"fcbkbttn_left\"><div class=\"fcbkbttn_like \"><fb:like href=\"https:\/\/www.yohz.com\/blogs\/2017\/11\/23\/compressing-pdfs-in-your-database\/\" action=\"like\" colorscheme=\"light\" layout=\"button\"  size=\"small\"><\/fb:like><\/div><div class=\"fb-share-button  \" data-href=\"https:\/\/www.yohz.com\/blogs\/2017\/11\/23\/compressing-pdfs-in-your-database\/\" data-type=\"button\" data-size=\"small\"><\/div><\/div>","protected":false},"excerpt":{"rendered":"<p>\u2192 This article refers to SQL Blob Viewer, which has now been renamed to SQL Image Viewer.\u00a0 The techniques described in this blog is still applicable, as the functionality of the product remains the same.\u00a0 Only the name has changed. Recently, a user wanted to compress PDFs stored in his database, in order to reduce [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[5,40,11],"tags":[16,7,41],"_links":{"self":[{"href":"https:\/\/www.yohz.com\/blogs\/wp-json\/wp\/v2\/posts\/249"}],"collection":[{"href":"https:\/\/www.yohz.com\/blogs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.yohz.com\/blogs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.yohz.com\/blogs\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.yohz.com\/blogs\/wp-json\/wp\/v2\/comments?post=249"}],"version-history":[{"count":8,"href":"https:\/\/www.yohz.com\/blogs\/wp-json\/wp\/v2\/posts\/249\/revisions"}],"predecessor-version":[{"id":513,"href":"https:\/\/www.yohz.com\/blogs\/wp-json\/wp\/v2\/posts\/249\/revisions\/513"}],"wp:attachment":[{"href":"https:\/\/www.yohz.com\/blogs\/wp-json\/wp\/v2\/media?parent=249"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.yohz.com\/blogs\/wp-json\/wp\/v2\/categories?post=249"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.yohz.com\/blogs\/wp-json\/wp\/v2\/tags?post=249"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}