[玩转系统] SharePoint Online:使用 PowerShell 查找重复文件
作者:精品下载站 日期:2024-12-14 15:31:13 浏览:12 分类:玩电脑
SharePoint Online:使用 PowerShell 查找重复文件
要求:在 SharePoint Online 中查找重复文档。
当来自不同团队的多人一起工作时,SharePoint 中很可能出现重复内容。人们可能已将相同的文档上传到各个库,甚至文档库中的不同文件夹。重复的文件会占用宝贵的存储空间,并且很难找到文件的正确版本。那么,如何在 SharePoint Online 中查找重复文档?那么,在这篇博文中,我将向您展示如何在 SharePoint Online 中查找重复文件。
SharePoint Online:使用 PowerShell 查找重复文档 - 文件哈希方法
如何在 SharePoint Online 中查找重复文件?让我们通过比较文件哈希来查找 SharePoint Online 文档库中的重复文件:
#Load SharePoint CSOM Assemblies
Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\ISAPI\Microsoft.SharePoint.Client.dll"
Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\ISAPI\Microsoft.SharePoint.Client.Runtime.dll"
#Parameters
$SiteURL = "https://Crescent.sharepoint.com/sites/marketing"
$ListName ="Branding"
#Array to Results Data
$DataCollection = @()
#Get credentials to connect
$Cred = Get-Credential
Try {
#Setup the Context
$Ctx = New-Object Microsoft.SharePoint.Client.ClientContext($SiteURL)
$Ctx.Credentials = New-Object Microsoft.SharePoint.Client.SharePointOnlineCredentials($Cred.UserName, $Cred.Password)
#Get the Web and List
$Web = $Ctx.Web
$Ctx.Load($Web)
$List = $Ctx.Web.Lists.GetByTitle($ListName)
$Ctx.Load($List)
$Ctx.ExecuteQuery()
#Define Query to get List Items in batch
$BatchSize = 2000
$Query = New-Object Microsoft.SharePoint.Client.CamlQuery
$Query.ViewXml = @"
<View Scope='RecursiveAll'>
<Query>
<OrderBy><FieldRef Name='ID' Ascending='TRUE'/></OrderBy>
</Query>
<RowLimit Paged="TRUE">$BatchSize</RowLimit>
</View>
"@
#Get List Items in Batch
$Count=1
Do
{
$ListItems = $List.GetItems($Query)
$Ctx.Load($ListItems)
$Ctx.ExecuteQuery()
#Process all items in the batch
ForEach($Item in $ListItems)
{
#Fiter Files
If($Item.FileSystemObjectType -eq "File")
{
#Get the File from Item
$File = $Item.File
$Ctx.Load($File)
$Ctx.ExecuteQuery()
Write-Progress -PercentComplete ($Count / $List.ItemCount * 100) -Activity "Processing File $count of $($List.ItemCount)" -Status "Scanning File '$($File.Name)'"
#Get The File Hash
$Bytes = $Item.file.OpenBinaryStream()
$Ctx.ExecuteQuery()
$MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
$HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value))
#Collect data
$Data = New-Object PSObject
$Data | Add-Member -MemberType NoteProperty -name "File Name" -value $File.Name
$Data | Add-Member -MemberType NoteProperty -Name "HashCode" -value $HashCode
$Data | Add-Member -MemberType NoteProperty -Name "URL" -value $File.ServerRelativeUrl
$DataCollection += $Data
}
$Count++
}
$Query.ListItemCollectionPosition = $ListItems.ListItemCollectionPosition
}While($Query.ListItemCollectionPosition -ne $null)
#Get Duplicate Files
$Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1} | Select -ExpandProperty Group
If($Duplicates.Count -gt 1)
{
$Duplicates | Out-GridView
}
Else
{
Write-host -f Yellow "No Duplicates Found!"
}
}
Catch {
write-host -f Red "Error:" $_.Exception.Message
}
但是,此方法不适用于 .docx、.pptx、.xlsx 等 Office 文档,因为 SharePoint 中 Office 文档的元数据存储在文档本身中,而对于其他文档类型,元数据存储在SharePoint 内容数据库。因此,当您两次上传同一个 Office 文档时,它们的元数据(例如“创建时间”)会有所不同!
PowerShell 查找站点中的所有重复文件(比较哈希、文件名和文件大小)
此 PowerShell 脚本扫描站点中所有文档库中的所有文件,并提取文件名、文件哈希和大小参数进行比较,以输出包含所有数据的 CSV 报告。
#Load SharePoint CSOM Assemblies
Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\ISAPI\Microsoft.SharePoint.Client.dll"
Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\ISAPI\Microsoft.SharePoint.Client.Runtime.dll"
#Parameters
$SiteURL = "https://Crescent.sharepoint.com/sites/marketing"
$CSVPath = "C:\Temp\Duplicates.csv"
$BatchSize = 2000
#Array for Result Data
$DataCollection = @()
#Get credentials to connect
$Cred = Get-Credential
Try {
#Setup the Context
$Ctx = New-Object Microsoft.SharePoint.Client.ClientContext($SiteURL)
$Ctx.Credentials = New-Object Microsoft.SharePoint.Client.SharePointOnlineCredentials($Cred.UserName, $Cred.Password)
#Get the Web
$Web = $Ctx.Web
$Lists = $Web.Lists
$Ctx.Load($Web)
$Ctx.Load($Lists)
$Ctx.ExecuteQuery()
#Iterate through Each List on the web
ForEach($List in $Lists)
{
#Filter Lists
If($List.BaseType -eq "DocumentLibrary" -and $List.Hidden -eq $False -and $List.ItemCount -gt 0 -and $List.Title -Notin("Site Pages","Style Library", "Preservation Hold Library"))
{
#Define CAML Query to get Files from the list in batches
$Query = New-Object Microsoft.SharePoint.Client.CamlQuery
$Query.ViewXml = "@
<View Scope='RecursiveAll'>
<Query>
<OrderBy><FieldRef Name='ID' Ascending='TRUE'/></OrderBy>
</Query>
<RowLimit Paged='TRUE'>$BatchSize</RowLimit>
</View>"
$Counter = 1
#Get Files from the Library in Batches
Do {
$ListItems = $List.GetItems($Query)
$Ctx.Load($ListItems)
$Ctx.ExecuteQuery()
ForEach($Item in $ListItems)
{
#Fiter Files
If($Item.FileSystemObjectType -eq "File")
{
#Get the File from Item
$File = $Item.File
$Ctx.Load($File)
$Ctx.ExecuteQuery()
Write-Progress -PercentComplete ($Counter / $List.ItemCount * 100) -Activity "Processing File $Counter of $($List.ItemCount) in $($List.Title) of $($Web.URL)" -Status "Scanning File '$($File.Name)'"
#Get The File Hash
$Bytes = $File.OpenBinaryStream()
$Ctx.ExecuteQuery()
$MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
$HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value))
#Collect data
$Data = New-Object PSObject
$Data | Add-Member -MemberType NoteProperty -name "FileName" -value $File.Name
$Data | Add-Member -MemberType NoteProperty -Name "HashCode" -value $HashCode
$Data | Add-Member -MemberType NoteProperty -Name "URL" -value $File.ServerRelativeUrl
$Data | Add-Member -MemberType NoteProperty -Name "FileSize" -value $File.Length
$DataCollection += $Data
}
$Counter++
}
#Update Postion of the ListItemCollectionPosition
$Query.ListItemCollectionPosition = $ListItems.ListItemCollectionPosition
}While($Query.ListItemCollectionPosition -ne $null)
}
}
#Export All Data to CSV
$DataCollection | Export-Csv -Path $CSVPath -NoTypeInformation
Write-host -f Green "Files Inventory has been Exported to $CSVPath"
#Get Duplicate Files by Grouping Hash code
$Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1} | Select -ExpandProperty Group
Write-host "Duplicate Files Based on File Hashcode:"
$Duplicates | Format-table -AutoSize
#Group Based on File Name
$FileNameDuplicates = $DataCollection | Group-Object -Property FileName | Where {$_.Count -gt 1} | Select -ExpandProperty Group
Write-host "Potential Duplicate Based on File Name:"
$FileNameDuplicates| Format-table -AutoSize
#Group Based on File Size
$FileSizeDuplicates = $DataCollection | Group-Object -Property FileSize | Where {$_.Count -gt 1} | Select -ExpandProperty Group
Write-host "Potential Duplicates Based on File Size:"
$FileSizeDuplicates| Format-table -AutoSize
}
Catch {
write-host -f Red "Error:" $_.Exception.Message
}
如果您尝试清理 SharePoint 环境并释放一些磁盘空间,这可能是一个有用的工具。
PnP PowerShell 用于查找 SharePoint Online 网站中的重复文件
这次,让我们使用 PnP PowerShell 从站点中的所有文档库中扫描并查找重复文件,并将结果导出到 CSV 文件!
#Parameters
$SiteURL = "https://Crescent.sharepoint.com/sites/Purchase"
$Pagesize = 2000
$ReportOutput = "C:\Temp\Duplicates.csv"
#Connect to SharePoint Online site
Connect-PnPOnline $SiteURL -Interactive
#Array to store results
$DataCollection = @()
#Get all Document libraries
$DocumentLibraries = Get-PnPList | Where-Object {$_.BaseType -eq "DocumentLibrary" -and $_.Hidden -eq $false -and $_.ItemCount -gt 0 -and $_.Title -Notin("Site Pages","Style Library", "Preservation Hold Library")}
#Iterate through each document library
ForEach($Library in $DocumentLibraries)
{
#Get All documents from the library
$global:counter = 0;
$Documents = Get-PnPListItem -List $Library -PageSize $Pagesize -Fields ID, File_x0020_Type -ScriptBlock `
{ Param($items) $global:counter += $items.Count; Write-Progress -PercentComplete ($global:Counter / ($Library.ItemCount) * 100) -Activity `
"Getting Documents from Library '$($Library.Title)'" -Status "Getting Documents data $global:Counter of $($Library.ItemCount)";} | Where {$_.FileSystemObjectType -eq "File"}
$ItemCounter = 0
#Iterate through each document
Foreach($Document in $Documents)
{
#Get the File from Item
$File = Get-PnPProperty -ClientObject $Document -Property File
#Get The File Hash
$Bytes = $File.OpenBinaryStream()
Invoke-PnPQuery
$MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
$HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value))
#Collect data
$Data = New-Object PSObject
$Data | Add-Member -MemberType NoteProperty -name "FileName" -value $File.Name
$Data | Add-Member -MemberType NoteProperty -Name "HashCode" -value $HashCode
$Data | Add-Member -MemberType NoteProperty -Name "URL" -value $File.ServerRelativeUrl
$Data | Add-Member -MemberType NoteProperty -Name "FileSize" -value $File.Length
$DataCollection += $Data
$ItemCounter++
Write-Progress -PercentComplete ($ItemCounter / ($Library.ItemCount) * 100) -Activity "Collecting data from Documents $ItemCounter of $($Library.ItemCount) from $($Library.Title)" `
-Status "Reading Data from Document '$($Document['FileLeafRef']) at '$($Document['FileRef'])"
}
}
#Get Duplicate Files by Grouping Hash code
$Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1} | Select -ExpandProperty Group
Write-host "Duplicate Files Based on File Hashcode:"
$Duplicates | Format-table -AutoSize
#Export the duplicates results to CSV
$Duplicates | Export-Csv -Path $ReportOutput -NoTypeInformation
总之,可以使用 PowerShell 脚本在 SharePoint Online 中查找重复文件,如上所述。值得注意的是,在开始查找重复文件之前,您需要拥有访问该网站和文件的权限,而且,根据文件量,此过程可能需要更长的时间。
猜你还喜欢
- 03-30 [玩转系统] 如何用批处理实现关机,注销,重启和锁定计算机
- 02-14 [系统故障] Win10下报错:该文件没有与之关联的应用来执行该操作
- 01-07 [系统问题] Win10--解决锁屏后会断网的问题
- 01-02 [系统技巧] Windows系统如何关闭防火墙保姆式教程,超详细
- 12-15 [玩转系统] 如何在 Windows 10 和 11 上允许多个 RDP 会话
- 12-15 [玩转系统] 查找 Exchange/Microsoft 365 中不活动(未使用)的通讯组列表
- 12-15 [玩转系统] 如何在 Windows 上安装远程服务器管理工具 (RSAT)
- 12-15 [玩转系统] 如何在 Windows 上重置组策略设置
- 12-15 [玩转系统] 如何获取计算机上的本地管理员列表?
- 12-15 [玩转系统] 在 Visual Studio Code 中连接到 MS SQL Server 数据库
- 12-15 [玩转系统] 如何降级 Windows Server 版本或许可证
- 12-15 [玩转系统] 如何允许非管理员用户在 Windows 中启动/停止服务
取消回复欢迎 你 发表评论:
- 精品推荐!
-
- 最新文章
- 热门文章
- 热评文章
[电影] 黄沙漫天(2025) 4K.EDRMAX.杜比全景声 / 4K杜比视界/杜比全景声
[风口福利] 短视频红利新风口!炬焰创作者平台重磅激励来袭
[韩剧] 宝物岛/宝藏岛/金银岛(2025)【全16集】【朴炯植/悬疑】
[电影] 愤怒的牦牛 (2025) 国语中字 4k
[短剧合集] 2025年05月30日 精选+付费短剧推荐56部
[软件合集] 25年5月30日 精选软件26个
[软件合集] 25年5月29日 精选软件18个
[短剧合集] 2025年05月28日 精选+付费短剧推荐38部
[软件合集] 25年5月28日 精选软件37个
[软件合集] 25年5月27日 精选软件26个
[剧集] [央视][笑傲江湖][2001][DVD-RMVB][高清][40集全]李亚鹏、许晴、苗乙乙
[电视剧] 欢乐颂.5部全 (2016-2024)
[电视剧] [突围] [45集全] [WEB-MP4/每集1.5GB] [国语/内嵌中文字幕] [4K-2160P] [无水印]
[影视] 【稀有资源】香港老片 艺坛照妖镜之96应召名册 (1996)
[剧集] 神经风云(2023)(完结).4K
[剧集] [BT] [TVB] [黑夜彩虹(2003)] [全21集] [粤语中字] [TV-RMVB]
[办公模版] office模板合集:包含word、Excel、PowerPoint、Access四类共计2000多个模板
[资源] B站充电视频合集,包含多位重量级up主,全是大佬真金白银买来的~【99GB】
[影视] 内地绝版高清录像带 [mpg]
[书籍] 古今奇书禁书三教九流资料大合集 猎奇必备珍藏资源PDF版 1.14G
[电视剧] [突围] [45集全] [WEB-MP4/每集1.5GB] [国语/内嵌中文字幕] [4K-2160P] [无水印]
[剧集] [央视][笑傲江湖][2001][DVD-RMVB][高清][40集全]李亚鹏、许晴、苗乙乙
[电影] 美国队长4 4K原盘REMUX 杜比视界 内封简繁英双语字幕 49G
[电影] 死神来了(1-6)大合集!
[软件合集] 25年05月13日 精选软件16个
[精品软件] 25年05月15日 精选软件18个
[绝版资源] 南与北 第1-2季 合集 North and South (1985) /美国/豆瓣: 8.8[1080P][中文字幕]
[软件] 25年05月14日 精选软件57个
[短剧] 2025年05月14日 精选+付费短剧推荐39部
[短剧] 2025年05月15日 精选+付费短剧推荐36部
- 最新评论
-
- 热门tag