Word: cutting a Word file

Cut a file and save it in several documents

Word: cutting a Word file

Cut a file and save it in several documents

Cut a Word file into several PDFs…

Initial Request

The initial request which led me to reflect on this subject came from a colleague.

My colleague receives a large Word document from his ERP containing on each page a letter for a different recipient.

In order to be able to integrate it into EDM (Electronic Document Management) he had the following needs:

  • Cut the large file into several files (1 page = 1 file)
  • Retrieve data from documents to name the PDF
  • Save new documents in PDF

Not knowing much about manipulating Word via PowerShell, I cut my thinking, as usual, into small steps.

Processing the request

Opening and cutting the file

The first step is therefore to open the Word document in question. To do this, instantiate a Word object and open the document

$word = New-Object -ComObject word.application
$word.Visible = $True
$doc = $word.Documents.Open($inputFile)

I leave “$word.Visible” at True for debugging time then I would put it in False so that the processing is completely transparent.

$ImputFile is a variable that contains the path to my Word file to process.

Once the document is open, we get to the heart of the matter: How to split this document from X pages into X documents of one page?

I searched a lot on the net before finding a solution which, perhaps is not the most elegant, but which has the merit of working ;-).

The principle is quite simple in fact: we take one page after another of the document and we copy / paste it into a new document that we save.

$pages = $doc.ComputeStatistics([Microsoft.Office.Interop.Word.WdStatistic]::wdStatisticPages)
$rngPage = $doc.Range()

for ($i = 1; $i -le $pages; $i += $pageLength)
{

    [Void]$word.Selection.GoTo([Microsoft.Office.Interop.Word.WdGoToItem]::wdGoToPage,

        [Microsoft.Office.Interop.Word.WdGoToDirection]::wdGoToAbsolute,

        $i #Starting Page

    )

    $rngPage.Start = $word.Selection.Start


    [Void]$word.Selection.GoTo([Microsoft.Office.Interop.Word.WdGoToItem]::wdGoToPage,

        [Microsoft.Office.Interop.Word.WdGoToDirection]::wdGoToAbsolute,

        $i + $pageLength #Next page Number

    )

    $rngPage.End = $word.Selection.Start

    $marginTop = $word.Selection.PageSetup.TopMargin
    $marginBottom = $word.Selection.PageSetup.BottomMargin
    $marginLeft = $word.Selection.PageSetup.LeftMargin
    $marginRight = $word.Selection.PageSetup.RightMargin

    $rngPage.Copy()
    $newDoc = $word.Documents.Add()

    $word.Selection.PageSetup.TopMargin = $marginTop
    $word.Selection.PageSetup.BottomMargin = $marginBottom
    $word.Selection.PageSetup.LeftMargin = $marginLeft
    $word.Selection.PageSetup.RightMargin = $marginRight

    $word.Selection.Paste() # Now we have our new page on a new doc
    $word.Selection.EndKey(6, 0) #Move to the end of the file
    $word.Selection.TypeBackspace() #Seems to grab an extra section/page break
    $word.Selection.Delete() #Now we have our doc down to size
}

I would not go into detail but the code is quite understandable. I got this piece of code and adapted it a bit to my needs. I admit that I don't necessarily understand everything he does (but he does :-))

We will see the recording a little later in the rest of the article

Retrieve the name of the file in the Word document

Each new Word document created corresponds to a letter for a recipient. In the case of a letter for tenants, the unique number that I have to recover in the Word document and the lease number which is in the form of a series of 10 numbers always starting with 0.

To find this number I used Regex.

$FileNamePattern = ".*de bail.*(0\d{9})"
$regex = [Regex]::Match($rngPage.Text, $fileNamePattern)

if ($regex.Success) {
    $id = $regex.Groups[1].Value
}
else {
    $id = "patternNotFound" + $i
}

I define my pattern which searches anywhere in the document for the string “lease” followed by any character and then followed by a group of 9 numbers preceded by a 0.

I apply my pattern and I get the result in the $ regex variable.

If all goes well I get the lease number in the $ id variable with which I would form my document name.

Save each document as PDF

$path = $outputPath + $id + ".pdf"
$newDoc.saveas([ref] $path, 17)
$newDoc.close([ref]$False)

$OutputPath is a variable that contains the path to the destination directory for my PDF files.

Here is the entire function

function Convert-Docx2Pdf {

[CmdletBinding()]
param (
    [Parameter(Mandatory = $False)][string]$FileNamePattern = ".*de bail.*(0\d{9})",
    [Parameter(Mandatory = $False)][string]$pageLength = 1,
    [Parameter(Mandatory = $true)][string]$InputFile ,
    [Parameter(Mandatory = $False)][string]$outputPath = $env:temp + "\Outputdir\"
)

BEGIN {
if (Test-Path $outputPath) {
    Remove-Item -Path $outputPath -Recurse -Force -Confirm:$false
}
New-Item -Path $outputPath -ItemType Directory -Force -Confirm:$false
}

PROCESS {
$word = New-Object -ComObject word.application

$word.Visible = $False



$doc = $word.Documents.Open($inputFile)

$pages = $doc.ComputeStatistics([Microsoft.Office.Interop.Word.WdStatistic]::wdStatisticPages)



$rngPage = $doc.Range()



for ($i = 1; $i -le $pages; $i += $pageLength)
{

    [Void]$word.Selection.GoTo([Microsoft.Office.Interop.Word.WdGoToItem]::wdGoToPage,

        [Microsoft.Office.Interop.Word.WdGoToDirection]::wdGoToAbsolute,

        $i #Starting Page

    )

    $rngPage.Start = $word.Selection.Start



    [Void]$word.Selection.GoTo([Microsoft.Office.Interop.Word.WdGoToItem]::wdGoToPage,

        [Microsoft.Office.Interop.Word.WdGoToDirection]::wdGoToAbsolute,

        $i + $pageLength #Next page Number

    )

    $rngPage.End = $word.Selection.Start



    $marginTop = $word.Selection.PageSetup.TopMargin

    $marginBottom = $word.Selection.PageSetup.BottomMargin

    $marginLeft = $word.Selection.PageSetup.LeftMargin

    $marginRight = $word.Selection.PageSetup.RightMargin


    $rngPage.Copy()

    $newDoc = $word.Documents.Add()


    $word.Selection.PageSetup.TopMargin = $marginTop

    $word.Selection.PageSetup.BottomMargin = $marginBottom

    $word.Selection.PageSetup.LeftMargin = $marginLeft

    $word.Selection.PageSetup.RightMargin = $marginRight



    $word.Selection.Paste() # Now we have our new page on a new doc

    $word.Selection.EndKey(6, 0) #Move to the end of the file

    $word.Selection.TypeBackspace() #Seems to grab an extra section/page break

    $word.Selection.Delete() #Now we have our doc down to size



    #Get Name

    $regex = [Regex]::Match($rngPage.Text, $fileNamePattern)

    if ($regex.Success) {
        $id = $regex.Groups[1].Value
    }
    else {
        $id = "patternNotFound" + $i
    }


    $path = $outputPath + $id + ".pdf"

    $newDoc.saveas([ref] $path, 17)

    $newDoc.close([ref]$False)
}
}

END {
[gc]::collect()
[gc]::WaitForPendingFinalizers()
}
}

See also