Azure 和 Google Cloud Storage 的 Manifest Committer

本文件說明如何使用Manifest Committer。

Manifest committer 是一個提交器，可提供 ABFS 在「實際」查詢中的效能，以及 GCS 的效能和正確性。它也可以與其他檔案系統搭配使用，包括 HDFS。不過，其設計是針對列舉操作緩慢且昂貴的物件儲存體進行最佳化。

提交器的架構和實作說明請參閱 Manifest Committer 架構。

通訊協定及其正確性說明請參閱 Manifest Committer 通訊協定。

此功能於 2022 年 3 月新增，在早期版本中應視為不穩定。

問題
解決方案。
在 Spark 中繫結到 manifest committer。
- 使用 Cloudstore committerinfo 指令探測提交器繫結。
驗證提交器是否已使用
調整工作 mapreduce.manifest.committer.io.threads
- mapreduce.manifest.committer.writer.queue.capacity
選用：在工作提交中刪除目標檔案
透過 ManifestPrinter 工具檢視 _SUCCESS 檔案。
收集工作摘要 mapreduce.manifest.committer.summary.report.directory
Spark 的 ABFS 選項全套
實驗性：ABFS 重新命名速率限制 fs.azure.io.rate.limit
進階設定選項
- 驗證輸出 mapreduce.manifest.committer.validate.output
- 控制儲存整合 mapreduce.manifest.committer.store.operations.classname
支援對相同目錄的同時作業

問題

從 Spark 到 Azure ADLS Gen 2「abfs://」儲存的唯一安全作業提交者是「v1 檔案提交者」。

這「正確」在於，如果任務嘗試失敗，其輸出保證不會包含在最終輸出中。「v2」提交演算法無法滿足此保證，這就是它不再是預設值的原因。

但是：它很慢，特別是在使用輸出目錄樹深度的情況下。為什麼它很慢？很難指出特定原因，主要是因為 FileOutputCommitter 中缺乏任何儀器。正在執行的作業堆疊追蹤通常會顯示 rename()，儘管列示作業也會浮現。

在 Google GCS 上，v1 或 v2 演算法都不是安全的，因為 Google 檔案系統沒有 v1 演算法所需的原子目錄重新命名。

進一步的問題是，Azure 和 GCS 儲存都可能遇到刪除具有許多子項的目錄的縮放問題。這可能會觸發逾時，因為 FileOutputCommitter 假設在作業後清理是一個呼叫 delete("_temporary", true) 的快速呼叫。

解決方案。

中間清單提交者是一個新的提交者，它應該為 ABFS 提供「真實世界」查詢的效能，以及 GCS 的效能和正確性。

此提交者使用 S3A 提交者提供的延伸點。使用者可以為 abfs:// 和 gcs:// URL 宣告一個新的提交者工廠。適當設定的 Spark 部署將選取新的提交者。

作業清理中的目錄效能問題可以用兩個選項來解決 1. 提交者會在刪除 _temporary 目錄之前並行刪除任務嘗試目錄。 1. 可以停用清理。

提交者可以與具有「真實」檔案 rename() 作業的任何檔案系統用戶端一起使用。它已針對列示和檔案探查很昂貴的遠端物件儲存進行最佳化 - 此設計不太可能在 HDFS 上提供如此顯著的加速 - 儘管並行重新命名作業會加速那裡的作業，與經典的 v1 演算法相比。

它的工作原理

詳細資訊請參閱 Manifest Committer 架構。

使用提交器

用於支援 S3A 提交器的掛鉤旨在允許每個檔案系統架構提供自己的提交器。請參閱切換到 S3A 提交器

abfs 架構的工廠將定義在 mapreduce.outputcommitter.factory.scheme.abfs 中；而 gcs 的工廠則類似。

需要進行一些匹配的 Spark 組態變更，特別是針對 Parquet 繫結。如果未在 mapred-default.xml JAR 中定義，則可以在 core-site.xml 中執行這些變更。

<property>
  <name>mapreduce.outputcommitter.factory.scheme.abfs</name>
  <value>org.apache.hadoop.fs.azurebfs.commit.AzureManifestCommitterFactory</value>
</property>
<property>
  <name>mapreduce.outputcommitter.factory.scheme.gs</name>
  <value>org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitterFactory</value>
</property>

繫結到 Spark 中的 manifest 提交器。

在 Apache Spark 中，組態可以使用命令列選項（在「–conf」之後）或使用 spark-defaults.conf 檔案來完成。以下是使用 spark-defaults.conf 的範例，其中還包括 Parquet 的組態，以及使用工廠機制的 Parquet 提交器子類別。

spark.hadoop.mapreduce.outputcommitter.factory.scheme.abfs org.apache.hadoop.fs.azurebfs.commit.AzureManifestCommitterFactory
spark.hadoop.mapreduce.outputcommitter.factory.scheme.gs org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitterFactory
spark.sql.parquet.output.committer.class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol

使用 Cloudstore `committerinfo` 命令來探查提交器繫結。

hadoop 提交器設定可以在 cloudstore 的最新版本及其 committerinfo 命令中驗證。此命令會透過與 MR 和 Spark 工作相同的工廠機制為該路徑建立一個提交器，然後列印其 toString 值。

hadoop jar cloudstore-1.0.jar committerinfo abfs://testing@ukwest.dfs.core.windows.net/

2021-09-16 19:42:59,731 [main] INFO  commands.CommitterInfo (StoreDurationInfo.java:<init>(53)) - Starting: Create committer
Committer factory for path abfs://testing@ukwest.dfs.core.windows.net/ is
 org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitterFactory@3315d2d7
  (classname org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitterFactory)
2021-09-16 19:43:00,897 [main] INFO  manifest.ManifestCommitter (ManifestCommitter.java:<init>(144)) - Created ManifestCommitter with
   JobID job__0000, Task Attempt attempt__0000_r_000000_1 and destination abfs://testing@ukwest.dfs.core.windows.net/
Created committer of class org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitter:
 ManifestCommitter{ManifestCommitterConfig{destinationDir=abfs://testing@ukwest.dfs.core.windows.net/,
   role='task committer',
   taskAttemptDir=abfs://testing@ukwest.dfs.core.windows.net/_temporary/manifest_job__0000/0/_temporary/attempt__0000_r_000000_1,
   createJobMarker=true,
   jobUniqueId='job__0000',
   jobUniqueIdSource='JobID',
   jobAttemptNumber=0,
   jobAttemptId='job__0000_0',
   taskId='task__0000_r_000000',
   taskAttemptId='attempt__0000_r_000000_1'},
   iostatistics=counters=();

gauges=();

minimums=();

maximums=();

means=();
}

驗證已使用提交器

新的提交器會在 _SUCCESS 檔案中撰寫作業的 JSON 摘要，包括統計資料。

如果此檔案存在且長度為零位元組：表示已使用傳統的 FileOutputCommitter。

如果此檔案存在且長度大於零位元組：表示已使用 manifest 提交器，或者在 S3A 檔案系統的情況下，表示已使用其中一個 S3A 提交器。它們都使用相同的 JSON 格式。

組態選項

以下是提交器的主要組態選項。

選項	意義	預設值
`mapreduce.manifest.committer.delete.target.files`	刪除目標檔案？	`false`
`mapreduce.manifest.committer.io.threads`	平行作業的執行緒數	`64`
`mapreduce.manifest.committer.summary.report.directory`	儲存報告的目錄。	`""`
`mapreduce.manifest.committer.cleanup.parallel.delete`	平行刪除暫存目錄	`true`
`mapreduce.fileoutputcommitter.cleanup.skipped`	略過清除`_temporary`目錄	`false`
`mapreduce.fileoutputcommitter.cleanup-failures.ignored`	略過清除期間的錯誤	`false`
`mapreduce.fileoutputcommitter.marksuccessfuljobs`	在成功完成時建立`_SUCCESS`標記檔案。(並在工作設定中刪除任何現有的檔案)	`true`

還有更多，如 (進階)[#advanced] 區段中所述。

調整工作 `mapreduce.manifest.committer.io.threads`

這個提交器比傳統的 FileOutputCommitter 更快的主要原因是它會盡可能在工作提交期間平行處理檔案 I/O，特別是

載入工作清單
刪除將建立目錄的檔案
建立目錄
逐檔案重新命名
在工作清除中刪除工作嘗試目錄

這些作業全部在同一個執行緒池中執行，其大小設定在選項 mapreduce.manifest.committer.io.threads 中。

可以使用較大的值。

Hadoop XML 設定

<property>
  <name>mapreduce.manifest.committer.io.threads</name>
  <value>32</value>
</property>

在 spark-defaults.conf 中

spark.hadoop.mapreduce.manifest.committer.io.threads 32

大於配置給 MapReduce AM 或 Spark Driver 的核心數值不會直接讓 CPU 超載，因為執行緒通常會等待 (慢速) I/O 對物件儲存/檔案系統完成。

在工作提交中載入清單可能會消耗大量記憶體；執行緒數越多，同時載入的清單就越多。

注意事項 * 在 Spark 中，可以在同一個程序中提交多個工作，每個工作在工作提交或清除期間都會建立自己的執行緒池。 * 如果對儲存體提出過多 I/O 要求，可能會觸發 Azure 速率限制。速率限制選項 mapreduce.manifest.committer.io.rate 可以協助避免這個問題。

`mapreduce.manifest.committer.writer.queue.capacity`

這是一個次要的調整選項。它控制佇列的大小，用於儲存從目標檔案系統載入的清單、從工作執行緒池載入的清單，以及將每個清單中的項目儲存到本機檔案系統中中間檔案的單一執行緒。

佇列滿時，所有清單載入執行緒都會封鎖。

<property>
  <name>mapreduce.manifest.committer.writer.queue.capacity</name>
  <value>32</value>
</property>

由於本機檔案系統通常寫入速度比任何雲端儲存體都還要快，因此這個佇列大小不應該是清單載入效能的限制。

它可以協助限制在工作提交期間載入清單時消耗的記憶體量。載入的清單最大數目會是

mapreduce.manifest.committer.writer.queue.capacity + mapreduce.manifest.committer.io.threads

選擇性：在工作提交中刪除目標檔案

傳統的 FileOutputCommitter 會在將作業檔案重新命名到定位之前，刪除目標路徑中的檔案。

這在明細提交器中是選用的，設定在選項 mapreduce.manifest.committer.delete.target.files 中，預設值為 false。

這會提升效能，而且當作業建立的所有檔案都有獨特檔名時，使用它是很安全的。

自從 SPARK-8406 在輸出檔名中加入 UUID 以避免意外覆寫 之後，Apache Spark 就會為 ORC 和 Parquet 產生獨特檔名。

避免檢查/刪除目標檔案，可以為每個提交的檔案節省一次刪除呼叫，因此可以節省大量的儲存體 IO。

當附加到現有表格時，使用 ORC 和 Parquet 以外的格式，除非確信每個檔名都加入了獨特識別碼，否則請啟用刪除目標檔案。

spark.hadoop.mapreduce.manifest.committer.delete.target.files true

註 1：當提交器建立要重新命名的檔案目錄時，它會略過刪除作業。這讓它稍微更有效率，至少如果附加資料的作業正在建立和寫入新的分割區時。

註 2：提交器仍然需要單一作業中的工作建立獨特檔案。這是任何作業產生正確資料的基礎。

Spark 動態分割覆寫

Spark 有個稱為「動態分割覆寫」的功能，

這可以在 SQL 中啟動

INSERT OVERWRITE TABLE ...

或透過 DataSet 寫入，其中模式為 覆寫，而且分割與現有表格的分割相符

sparkConf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
// followed by an overwrite of a Dataset into an existing partitioned table.
eventData2
  .write
  .mode("overwrite")
  .partitionBy("year", "month")
  .format("parquet")
  .save(existingDir)

此功能在 Spark 中實作，它 1. 指示作業將其新資料寫入暫時目錄 1. 在作業提交完成後，掃描輸出以識別資料寫入其中的葉子目錄「分割區」。 1. 刪除目標表格中那些目錄的內容 1. 將新檔案重新命名到分割區。

這一切都在 Spark 中完成，它接管掃描中間輸出樹、刪除分割區和重新命名新檔案的任務。

此功能還新增作業完全在目標表格外部寫入資料的能力，這是透過 1. 將新檔案寫入工作目錄 1. Spark 在作業提交時將它們移到最終目的地來完成的

明細提交器與 Azure 和 Google 雲端儲存空間上的動態分割覆寫相容，因為它們共同符合擴充功能的核心需求：1. getWorkPath() 中傳回的工作目錄與最終輸出位於同一個檔案系統中。2. rename() 是 O(1) 作業，在提交工作時使用時安全且快速。

S3A 提交器都不支援這個功能。暫存提交器不符合條件 (1)，而 S3 本身不符合條件 (2)。

要將明細提交器與動態分割覆寫搭配使用，Spark 版本必須包含 SPARK-40034 路徑輸出提交器可搭配動態分割覆寫使用。

請注意，如果重新命名許多檔案，作業的重新命名階段會很慢，因為這是循序執行的。平行重新命名會加快這個過程，但可能會觸發明細提交器設計用來將風險降到最低並支援復原的 ABFS 過載問題

提交作業的 Spark 端會列出/樹狀瀏覽暫時輸出目錄（一些額外負擔），然後提升檔案，使用傳統檔案系統 rename() 呼叫執行。這裡不會有明確的速率限制。

這代表什麼意思？

這表示不應在 Azure Storage 上對建立數千個檔案的 SQL 查詢/Spark 資料集作業使用動態分割。這些作業會在節流擴充問題浮現之前就出現效能問題，這點應視為警告。

`_SUCCESS` 檔案中的工作摘要

原始 Hadoop 提交器會在輸出目錄的根目錄建立一個零位元組 _SUCCESS 檔案，除非已停用。

這個提交器會寫入一個 JSON 摘要，其中包括 * 提交器的名稱。* 診斷資訊。* 一些已建立檔案的清單（用於測試；會排除完整清單，因為它可能會很大）。* IO 統計資料。

如果執行查詢後，這個 _SUCCESS 檔案的長度為零位元組，表示未曾使用過新的提交器

如果它不是空的，則可以檢查它。

透過 `ManifestPrinter` 工具檢視 `_SUCCESS` 檔案檔案。

摘要檔案是 JSON，可以在任何文字編輯器中檢視。

要取得更簡潔的摘要，包括更佳的統計資料顯示，請使用 ManifestPrinter 工具。

hadoop org.apache.hadoop.mapreduce.lib.output.committer.manifest.files.ManifestPrinter <path>

這適用於儲存在輸出目錄根目錄的檔案，以及儲存在報告目錄中的任何報告。

收集工作摘要 `mapreduce.manifest.committer.summary.report.directory`

可以將提交器設定為將 _SUCCESS 摘要檔案儲存到報告目錄，無論工作是否成功或失敗，方法是在選項 mapreduce.manifest.committer.summary.report.directory 中設定檔案系統路徑。

路徑不必與工作的目的地位於相同的儲存體/檔案系統上。例如，可以使用本機檔案系統。

XML

<property>
  <name>mapreduce.manifest.committer.summary.report.directory</name>
  <value>file:///tmp/reports</value>
</property>

spark-defaults.conf

spark.hadoop.mapreduce.manifest.committer.summary.report.directory file:///tmp/reports

這允許收集工作的統計資料，無論其結果如何，是否啟用儲存 _SUCCESS 標記，以及不會因為一連串查詢覆寫標記而造成問題。

清理

工作清理很複雜，因為它旨在解決雲端儲存中可能出現的許多問題。

刪除目錄的效能緩慢。
刪除非常深且寬的目錄樹時會逾時。
一般能復原清理問題，並將其升級為工作失敗。

選項	意義	預設值
`mapreduce.fileoutputcommitter.cleanup.skipped`	略過清除`_temporary`目錄	`false`
`mapreduce.fileoutputcommitter.cleanup-failures.ignored`	略過清除期間的錯誤	`false`
`mapreduce.manifest.committer.cleanup.parallel.delete`	平行刪除工作嘗試目錄	`true`

演算法是

if `mapreduce.fileoutputcommitter.cleanup.skipped`:
  return
if `mapreduce.manifest.committer.cleanup.parallel.delete`:
  attempt parallel delete of task directories; catch any exception
if not `mapreduce.fileoutputcommitter.cleanup.skipped`:
  delete(`_temporary`); catch any exception
if caught-exception and not `mapreduce.fileoutputcommitter.cleanup-failures.ignored`:
  throw caught-exception

這有點複雜，但目標是執行快速/可擴充的刪除，如果無法執行，則擲回有意義的例外狀況。

在使用 ABFS 和 GCS 時，這些設定通常應保持不變。如果在清理期間出現錯誤，啟用忽略失敗的選項將確保工作仍能完成。停用清理甚至可以避免清理的開銷，但需要工作流程或手動操作來定期清理所有 _temporary 目錄。

使用 Azure ADLS Gen2 儲存體

若要切換到清單提交器，必須將具有 abfs:// URL 的目的地的提交器工廠切換到清單提交器工廠，適用於應用程式或整個叢集。

<property>
  <name>mapreduce.outputcommitter.factory.scheme.abfs</name>
  <value>org.apache.hadoop.fs.azurebfs.commit.AzureManifestCommitterFactory</value>
</property>

這允許在提交程式中使用 ADLS Gen2 特定的效能和一致性邏輯。特別是：* Etag 標頭可以在清單中收集，並在工作提交階段使用。* IO 重新命名作業受到速率限制 * 當節流觸發重新命名失敗時，會嘗試復原。

警告此提交程式與舊版 Azure 儲存服務（WASB 或 ADLS Gen 1）不相容。

Azure 最佳化選項的核心組變為

<property>
  <name>mapreduce.outputcommitter.factory.scheme.abfs</name>
  <value>org.apache.hadoop.fs.azurebfs.commit.AzureManifestCommitterFactory</value>
</property>

<property>
  <name>spark.hadoop.fs.azure.io.rate.limit</name>
  <value>10000</value>
</property>

以及用於除錯/效能分析的選用設定

<property>
  <name>mapreduce.manifest.committer.summary.report.directory</name>
  <value>abfs:// Path within same store/separate store</value>
  <description>Optional: path to where job summaries are saved</description>
</property>

spark 的 ABFS 選項完整組

spark.hadoop.mapreduce.outputcommitter.factory.scheme.abfs org.apache.hadoop.fs.azurebfs.commit.AzureManifestCommitterFactory
spark.hadoop.fs.azure.io.rate.limit 10000
spark.sql.parquet.output.committer.class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol

spark.hadoop.mapreduce.manifest.committer.summary.report.directory  (optional: URI of a directory for job summaries)

實驗性：ABFS 重新命名速率限制 `fs.azure.io.rate.limit`

為避免觸發儲存節流和延後延遲，以及其他與節流相關的失敗情況，工作提交期間的檔案重新命名會透過「速率限制器」進行節流，此限制器會限制 ABFS 檔案系統用戶端單一執行個體每秒可以發出多少重新命名作業。

選項	意義
`fs.azure.io.rate.limit`	IO 作業的作業/秒速率限制。

將選項設定為 0 以移除所有速率限制。

此選項的預設值設定為 10000，這是 ADLS 儲存帳戶的預設 IO 容量。

<property>
  <name>fs.azure.io.rate.limit</name>
  <value>10000</value>
  <description>maximum number of renames attempted per second</description>
</property>

此容量設定在檔案系統用戶端層級，因此不會在單一應用程式中的所有處理程序之間共用，更不用說共用相同儲存帳戶的其他應用程式了。

它會與由相同 Spark 驅動程式提交的所有工作共用，因為這些工作會共用該檔案系統連接器。

如果實施速率限制，統計資料 store_io_rate_limited 會報告取得提交檔案許可的時間。

如果發生伺服器端節流，可以在下列位置看到其跡象：* 儲存服務的記錄檔及其節流狀態碼（通常為 503 或 500）。* 工作統計資料 commit_file_rename_recovered。此統計資料表示 ADLS 節流以重新命名失敗的形式表現出來，這些失敗已在提交程式中復原。

如果看到這些跡象，或同時執行的其他應用程式遇到節流/節流觸發的問題，請考慮降低 fs.azure.io.rate.limit 的值，和/或向 Microsoft 要求更高的 IO 容量。

重要如果您從 Microsoft 取得額外容量，而且您想使用它來加速工作提交，請增加 fs.azure.io.rate.limit 的值，無論是在整個叢集或特別針對您希望分配額外優先順序的工作。

這仍是進行中的工作；它可能會擴充以支援單一檔案系統執行個體執行的所有 IO 作業。

使用 Google Cloud Storage

明細提交器與 Google 雲端儲存相容，並透過 Google 的 gcs-connector 函式庫進行測試，它提供 Hadoop 檔案系統用戶端給 schema gs。

Google 雲端儲存具有提交協定安全運作所需的語意。

切換到此提交器的 Spark 設定為

spark.hadoop.mapreduce.outputcommitter.factory.scheme.gs org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitterFactory
spark.sql.parquet.output.committer.class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol

spark.hadoop.mapreduce.manifest.committer.summary.report.directory  (optional: URI of a directory for job summaries)

儲存體的目錄刪除作業為 O(files)，因此 mapreduce.manifest.committer.cleanup.parallel.delete 的值應保留預設值 true。

對於 mapreduce，請在 core-site.xml 或 mapred-site.xml 中宣告繫結。

<property>
  <name>mapreduce.outputcommitter.factory.scheme.gcs</name>
  <value>org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitterFactory</value>
</property>

使用 HDFS

此提交器確實適用於 HDFS，它只是針對在某些作業（特別是列示和重新命名）上效能降低的物件儲存體，以及語意太過簡化，無法依賴傳統的 FileOutputCommitter（特別是 GCS）。

若要在 HDFS 上使用，請將 ManifestCommitterFactory 設定為 hdfs:// URL 的提交器工廠。

由於 HDFS 進行快速目錄刪除，因此在清理期間不需要將工作嘗試目錄的刪除並行化，因此將 mapreduce.manifest.committer.cleanup.parallel.delete 設定為 false

最後的 spark 繫結變為

spark.hadoop.mapreduce.outputcommitter.factory.scheme.hdfs org.apache.hadoop.mapreduce.lib.output.committer.manifest.ManifestCommitterFactory
spark.hadoop.mapreduce.manifest.committer.cleanup.parallel.delete false
spark.sql.parquet.output.committer.class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol

spark.hadoop.mapreduce.manifest.committer.summary.report.directory  (optional: URI of a directory for job summaries)

進階主題

進階組態選項

有一些進階選項是針對開發和測試，而非實際使用。

選項	意義	預設值
`mapreduce.manifest.committer.store.operations.classname`	明細儲存體作業的類別名稱	`""`
`mapreduce.manifest.committer.validate.output`	執行輸出驗證？	`false`
`mapreduce.manifest.committer.writer.queue.capacity`	寫入中間檔案的佇列容量	`32`

驗證輸出 `mapreduce.manifest.committer.validate.output`

選項 mapreduce.manifest.committer.validate.output 會觸發對每個重新命名檔案的檢查，以驗證它具有預期的長度。

這會增加每個檔案的 HEAD 要求的負擔，因此建議僅用於測試。

不會驗證實際內容。

控制儲存整合 `mapreduce.manifest.committer.store.operations.classname`

明細提交器透過實作介面 ManifestStoreOperations 與檔案系統互動。可以提供自訂實作以供儲存特定功能使用。ABFS 有其中一種；當使用 abfs 特定提交器工廠時，會自動設定這個值。

可以明確設定這個值。

<property>
  <name>mapreduce.manifest.committer.store.operations.classname</name>
  <value>org.apache.hadoop.fs.azurebfs.commit.AbfsManifestStoreOperations</value>
</property>

也可以設定預設實作。

<property>
  <name>mapreduce.manifest.committer.store.operations.classname</name>
  <value>org.apache.hadoop.mapreduce.lib.output.committer.manifest.impl.ManifestStoreOperationsThroughFileSystem</value>
</property>

不需要變更這些值，除非要為其他儲存寫入新的實作，而這只有在儲存提供額外的整合支援給提交器時才需要。

支援對相同目錄的並行工作

可能可以執行多個針對相同目錄樹的工作。

要執行這個動作，必須符合多項條件

使用 spark 時，必須設定唯一的作業 ID。這表示 Spark 發行版必須包含 SPARK-33402 和 SPARK-33230 的修補程式。
必須透過將 mapreduce.fileoutputcommitter.cleanup.skipped 設定為 true 來停用 _temporary 目錄的清除。
所有工作/工作必須建立具有唯一檔名的檔案。
所有工作都必須建立具有相同目錄分割結構的輸出。
工作/查詢不得使用 Spark 動態分割「覆寫插入表格」；否則資料可能會遺失。這適用於所有提交器，而不僅限於明細提交器。
記得稍後刪除 _temporary 目錄！

這尚未經過測試

一般

一般

HDFS

MapReduce

MapReduce REST API

YARN

YARN REST API

YARN 服務

與 Hadoop 相容的檔案系統

Auth

工具

參考

組態

Azure 和 Google Cloud Storage 的 Manifest Committer

問題

解決方案。

它的工作原理

使用提交器

繫結到 Spark 中的 manifest 提交器。

使用 Cloudstore `committerinfo` 命令來探查提交器繫結。

驗證已使用提交器

組態選項

調整工作 `mapreduce.manifest.committer.io.threads`

`mapreduce.manifest.committer.writer.queue.capacity`

選擇性：在工作提交中刪除目標檔案

Spark 動態分割覆寫

`_SUCCESS` 檔案中的工作摘要

透過 `ManifestPrinter` 工具檢視 `_SUCCESS` 檔案檔案。

收集工作摘要 `mapreduce.manifest.committer.summary.report.directory`

清理

使用 Azure ADLS Gen2 儲存體

spark 的 ABFS 選項完整組

實驗性：ABFS 重新命名速率限制 `fs.azure.io.rate.limit`

使用 Google Cloud Storage

使用 HDFS

進階主題

進階組態選項

驗證輸出 `mapreduce.manifest.committer.validate.output`

控制儲存整合 `mapreduce.manifest.committer.store.operations.classname`

支援對相同目錄的並行工作

一般

一般

HDFS

MapReduce

MapReduce REST API

YARN

YARN REST API

YARN 服務

與 Hadoop 相容的檔案系統

Auth

工具

參考

組態

Azure 和 Google Cloud Storage 的 Manifest Committer

問題

解決方案。

它的工作原理

使用提交器

繫結到 Spark 中的 manifest 提交器。

使用 Cloudstore committerinfo 命令來探查提交器繫結。

驗證已使用提交器

組態選項

調整工作 mapreduce.manifest.committer.io.threads

mapreduce.manifest.committer.writer.queue.capacity

選擇性：在工作提交中刪除目標檔案

Spark 動態分割覆寫

_SUCCESS 檔案中的工作摘要

透過 ManifestPrinter 工具檢視 _SUCCESS 檔案檔案。

收集工作摘要 mapreduce.manifest.committer.summary.report.directory

清理

使用 Azure ADLS Gen2 儲存體

spark 的 ABFS 選項完整組

實驗性：ABFS 重新命名速率限制 fs.azure.io.rate.limit

使用 Google Cloud Storage

使用 HDFS

進階主題

進階組態選項

驗證輸出 mapreduce.manifest.committer.validate.output

控制儲存整合 mapreduce.manifest.committer.store.operations.classname

支援對相同目錄的並行工作

使用 Cloudstore `committerinfo` 命令來探查提交器繫結。

調整工作 `mapreduce.manifest.committer.io.threads`

`mapreduce.manifest.committer.writer.queue.capacity`

`_SUCCESS` 檔案中的工作摘要

透過 `ManifestPrinter` 工具檢視 `_SUCCESS` 檔案檔案。

收集工作摘要 `mapreduce.manifest.committer.summary.report.directory`

實驗性：ABFS 重新命名速率限制 `fs.azure.io.rate.limit`

驗證輸出 `mapreduce.manifest.committer.validate.output`

控制儲存整合 `mapreduce.manifest.committer.store.operations.classname`