什么是PARQUET_READ_PARALLELISM?(What is PARQUET_READ_PARALLELISM?)

当我运行我的工作时,我看到: parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5

它默认设置为5,但它是什么? 以及如何使用它来获得更好的性能?

When I run my jobs I see: parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5

It is by default set to 5 but what is it? and how can I used it to get better performance?

最满意答案

是的,它默认为5。

配置参数的名称是parquet.metadata.read.parallelism 。 它仅影响有多少线程读取有关Parquet文件的元信息。

我相信它不会影响性能,因为它只涉及元数据的读取,而不是数据本身。

Yes, it defaults to 5.

The configuration parameter's name is parquet.metadata.read.parallelism. It affects only in how many threads metainformation about Parquet files is read.

I believe it does not affect performance much as it's only related to reading of metadata, not the data itself.

更多推荐