当前位置:网站首页 > 更多 > 玩电脑 > 正文

[玩转系统] SharePoint Online:使用 PowerShell 查找重复文件

作者:精品下载站 日期:2024-12-14 15:31:13 浏览:12 分类:玩电脑

SharePoint Online:使用 PowerShell 查找重复文件


要求:在 SharePoint Online 中查找重复文档。

当来自不同团队的多人一起工作时,SharePoint 中很可能出现重复内容。人们可能已将相同的文档上传到各个库,甚至文档库中的不同文件夹。重复的文件会占用宝贵的存储空间,并且很难找到文件的正确版本。那么,如何在 SharePoint Online 中查找重复文档?那么,在这篇博文中,我将向您展示如何在 SharePoint Online 中查找重复文件。

[玩转系统] SharePoint Online:使用 PowerShell 查找重复文件

SharePoint Online:使用 PowerShell 查找重复文档 - 文件哈希方法

如何在 SharePoint Online 中查找重复文件?让我们通过比较文件哈希来查找 SharePoint Online 文档库中的重复文件:


#Load SharePoint CSOM Assemblies
Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\ISAPI\Microsoft.SharePoint.Client.dll"
Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\ISAPI\Microsoft.SharePoint.Client.Runtime.dll"
 
#Parameters
$SiteURL = "https://Crescent.sharepoint.com/sites/marketing"
$ListName ="Branding"
 
#Array to Results Data
$DataCollection = @() 
 
#Get credentials to connect
$Cred = Get-Credential
 
Try {
    #Setup the Context
    $Ctx = New-Object Microsoft.SharePoint.Client.ClientContext($SiteURL)
    $Ctx.Credentials = New-Object Microsoft.SharePoint.Client.SharePointOnlineCredentials($Cred.UserName, $Cred.Password)
 
    #Get the Web and List
    $Web = $Ctx.Web
    $Ctx.Load($Web)
    $List = $Ctx.Web.Lists.GetByTitle($ListName)
    $Ctx.Load($List)
    $Ctx.ExecuteQuery()

    #Define Query to get List Items in batch
    $BatchSize = 2000
    $Query = New-Object Microsoft.SharePoint.Client.CamlQuery
    $Query.ViewXml = @"
    <View Scope='RecursiveAll'>
        <Query>
            <OrderBy><FieldRef Name='ID' Ascending='TRUE'/></OrderBy>
        </Query>
        <RowLimit Paged="TRUE">$BatchSize</RowLimit>
    </View>
"@

    #Get List Items in Batch
    $Count=1
    Do
    {
        $ListItems = $List.GetItems($Query)
        $Ctx.Load($ListItems)
        $Ctx.ExecuteQuery()
        
        #Process all items in the batch    
        ForEach($Item in $ListItems)
        {
            #Fiter Files
            If($Item.FileSystemObjectType -eq "File")
            {
                #Get the File from Item
                $File = $Item.File
                $Ctx.Load($File)
                $Ctx.ExecuteQuery()
                Write-Progress -PercentComplete ($Count / $List.ItemCount * 100) -Activity "Processing File $count of $($List.ItemCount)" -Status "Scanning File '$($File.Name)'"
 
                #Get The File Hash
                $Bytes = $Item.file.OpenBinaryStream()
                $Ctx.ExecuteQuery()
                $MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider 
                $HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value)) 
 
                #Collect data        
                $Data = New-Object PSObject 
                $Data | Add-Member -MemberType NoteProperty -name "File Name" -value $File.Name
                $Data | Add-Member -MemberType NoteProperty -Name "HashCode" -value $HashCode
                $Data | Add-Member -MemberType NoteProperty -Name "URL" -value $File.ServerRelativeUrl
                $DataCollection += $Data
            }
            $Count++
        }
        $Query.ListItemCollectionPosition = $ListItems.ListItemCollectionPosition
    }While($Query.ListItemCollectionPosition -ne $null)
    
    #Get Duplicate Files
    $Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1}  | Select -ExpandProperty Group
    If($Duplicates.Count -gt 1)
    {
        $Duplicates | Out-GridView
    }
    Else
    {
        Write-host -f Yellow "No Duplicates Found!"
    }
}
Catch {
    write-host -f Red "Error:" $_.Exception.Message
}

但是,此方法不适用于 .docx、.pptx、.xlsx 等 Office 文档,因为 SharePoint 中 Office 文档的元数据存储在文档本身中,而对于其他文档类型,元数据存储在SharePoint 内容数据库。因此,当您两次上传同一个 Office 文档时,它们的元数据(例如“创建时间”)会有所不同!

PowerShell 查找站点中的所有重复文件(比较哈希、文件名和文件大小)

此 PowerShell 脚本扫描站点中所有文档库中的所有文件,并提取文件名、文件哈希和大小参数进行比较,以输出包含所有数据的 CSV 报告。


#Load SharePoint CSOM Assemblies
Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\ISAPI\Microsoft.SharePoint.Client.dll"
Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\ISAPI\Microsoft.SharePoint.Client.Runtime.dll"

#Parameters
$SiteURL = "https://Crescent.sharepoint.com/sites/marketing"
$CSVPath = "C:\Temp\Duplicates.csv"
$BatchSize = 2000
#Array for Result Data
$DataCollection = @()

#Get credentials to connect
$Cred = Get-Credential

Try {
    #Setup the Context
    $Ctx = New-Object Microsoft.SharePoint.Client.ClientContext($SiteURL)
    $Ctx.Credentials = New-Object Microsoft.SharePoint.Client.SharePointOnlineCredentials($Cred.UserName, $Cred.Password)

    #Get the Web
    $Web = $Ctx.Web
    $Lists = $Web.Lists
    $Ctx.Load($Web)
    $Ctx.Load($Lists)
    $Ctx.ExecuteQuery()

    #Iterate through Each List on the web
    ForEach($List in $Lists)
    {
        #Filter Lists
        If($List.BaseType -eq "DocumentLibrary" -and $List.Hidden -eq $False  -and $List.ItemCount -gt 0 -and $List.Title -Notin("Site Pages","Style Library", "Preservation Hold Library"))
        {
            #Define CAML Query to get Files from the list in batches
            $Query = New-Object Microsoft.SharePoint.Client.CamlQuery
            $Query.ViewXml = "@
                <View Scope='RecursiveAll'>
                    <Query>
                        <OrderBy><FieldRef Name='ID' Ascending='TRUE'/></OrderBy>            
                    </Query>
                    <RowLimit Paged='TRUE'>$BatchSize</RowLimit>
                </View>"

            $Counter = 1
            #Get Files from the Library in Batches
            Do {
                $ListItems = $List.GetItems($Query)
                $Ctx.Load($ListItems)
                $Ctx.ExecuteQuery()

                ForEach($Item in $ListItems)
                {
                    #Fiter Files
                    If($Item.FileSystemObjectType -eq "File")
                    {
                        #Get the File from Item
                        $File = $Item.File
                        $Ctx.Load($File)
                        $Ctx.ExecuteQuery()
                        Write-Progress -PercentComplete ($Counter / $List.ItemCount * 100) -Activity "Processing File $Counter of $($List.ItemCount) in $($List.Title) of $($Web.URL)" -Status "Scanning File '$($File.Name)'" 

                        #Get The File Hash
                        $Bytes = $File.OpenBinaryStream()
                        $Ctx.ExecuteQuery()
                        $MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider 
                        $HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value)) 

                        #Collect data        
                        $Data = New-Object PSObject 
                        $Data | Add-Member -MemberType NoteProperty -name "FileName" -value $File.Name
                        $Data | Add-Member -MemberType NoteProperty -Name "HashCode" -value $HashCode
                        $Data | Add-Member -MemberType NoteProperty -Name "URL" -value $File.ServerRelativeUrl
                        $Data | Add-Member -MemberType NoteProperty -Name "FileSize" -value $File.Length        
                        $DataCollection += $Data
                    }
                    $Counter++
                }
                #Update Postion of the ListItemCollectionPosition
                $Query.ListItemCollectionPosition = $ListItems.ListItemCollectionPosition
            }While($Query.ListItemCollectionPosition -ne $null)
        }
    }
    #Export All Data to CSV
    $DataCollection | Export-Csv -Path $CSVPath -NoTypeInformation
    Write-host -f Green "Files Inventory has been Exported to $CSVPath"

    #Get Duplicate Files by Grouping Hash code
    $Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1}  | Select -ExpandProperty Group
    Write-host "Duplicate Files Based on File Hashcode:"
    $Duplicates | Format-table -AutoSize

    #Group Based on File Name
    $FileNameDuplicates = $DataCollection | Group-Object -Property FileName | Where {$_.Count -gt 1}  | Select -ExpandProperty Group
    Write-host "Potential Duplicate Based on File Name:"
    $FileNameDuplicates| Format-table -AutoSize

    #Group Based on File Size
    $FileSizeDuplicates = $DataCollection | Group-Object -Property FileSize | Where {$_.Count -gt 1}  | Select -ExpandProperty Group
    Write-host "Potential Duplicates Based on File Size:"
    $FileSizeDuplicates| Format-table -AutoSize
}
Catch {
    write-host -f Red "Error:" $_.Exception.Message
}

如果您尝试清理 SharePoint 环境并释放一些磁盘空间,这可能是一个有用的工具。

PnP PowerShell 用于查找 SharePoint Online 网站中的重复文件

这次,让我们使用 PnP PowerShell 从站点中的所有文档库中扫描并查找重复文件,并将结果导出到 CSV 文件!


#Parameters
$SiteURL = "https://Crescent.sharepoint.com/sites/Purchase"
$Pagesize = 2000
$ReportOutput = "C:\Temp\Duplicates.csv"

#Connect to SharePoint Online site
Connect-PnPOnline $SiteURL -Interactive
 
#Array to store results
$DataCollection = @()

#Get all Document libraries
$DocumentLibraries = Get-PnPList | Where-Object {$_.BaseType -eq "DocumentLibrary" -and $_.Hidden -eq $false -and $_.ItemCount -gt 0 -and $_.Title -Notin("Site Pages","Style Library", "Preservation Hold Library")}

#Iterate through each document library
ForEach($Library in $DocumentLibraries)
{    
    #Get All documents from the library
    $global:counter = 0;
    $Documents = Get-PnPListItem -List $Library -PageSize $Pagesize -Fields ID, File_x0020_Type -ScriptBlock `
        { Param($items) $global:counter += $items.Count; Write-Progress -PercentComplete ($global:Counter / ($Library.ItemCount) * 100) -Activity `
             "Getting Documents from Library '$($Library.Title)'" -Status "Getting Documents data $global:Counter of $($Library.ItemCount)";} | Where {$_.FileSystemObjectType -eq "File"}
  
    $ItemCounter = 0
    #Iterate through each document
    Foreach($Document in $Documents)
    {
        #Get the File from Item
        $File = Get-PnPProperty -ClientObject $Document -Property File

        #Get The File Hash
        $Bytes = $File.OpenBinaryStream()
        Invoke-PnPQuery
        $MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
        $HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value))
 
        #Collect data        
        $Data = New-Object PSObject 
        $Data | Add-Member -MemberType NoteProperty -name "FileName" -value $File.Name
        $Data | Add-Member -MemberType NoteProperty -Name "HashCode" -value $HashCode
        $Data | Add-Member -MemberType NoteProperty -Name "URL" -value $File.ServerRelativeUrl
        $Data | Add-Member -MemberType NoteProperty -Name "FileSize" -value $File.Length        
        $DataCollection += $Data
        $ItemCounter++
        Write-Progress -PercentComplete ($ItemCounter / ($Library.ItemCount) * 100) -Activity "Collecting data from Documents $ItemCounter of $($Library.ItemCount) from $($Library.Title)" `
                     -Status "Reading Data from Document '$($Document['FileLeafRef']) at '$($Document['FileRef'])"
    }
}
#Get Duplicate Files by Grouping Hash code
$Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1}  | Select -ExpandProperty Group
Write-host "Duplicate Files Based on File Hashcode:"
$Duplicates | Format-table -AutoSize

#Export the duplicates results to CSV
$Duplicates | Export-Csv -Path $ReportOutput -NoTypeInformation

总之,可以使用 PowerShell 脚本在 SharePoint Online 中查找重复文件,如上所述。值得注意的是,在开始查找重复文件之前,您需要拥有访问该网站和文件的权限,而且,根据文件量,此过程可能需要更长的时间。

您需要 登录账户 后才能发表评论

取消回复欢迎 发表评论:

关灯