hadoop - 如何在内存中使用数据作为输入格式?(hadoop - How can i use data in memory as input format?)


将输入传递给映射器的常用方法是通过Hdfs - sequencefileinputformat或Textfileinputformat。 这些输入格式需要在fdfs中包含文件,这些文件将被加载并分割为映射器

我找不到一个简单的方法来传递,让我们说给元素的元素列表。 我发现自己必须将这些元素用于磁盘,然后使用fileinputformat。


我在java offcourse中编写代码。


I'm writing a mapreduce job, and I have the input that I want to pass to the mappers in the memory.

The usual method to pass input to the mappers is via the Hdfs - sequencefileinputformat or Textfileinputformat. These inputformats need to have files in the fdfs which will be loaded and splitted to the mappers

I cant find a simple method to pass, lets say List of elemnts to the mappers. I find myself having to wrtite these elements to disk and then use fileinputformat.

any solution?

I'm writing the code in java offcourse.



输入格式不必从磁盘或文件系统加载数据。 还有一些输入格式从其他系统读取数据,如HBase或(http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapred/TableInputFormat.html),其中数据并不隐含在磁盘上。 只暗示可以通过群集的所有节点上的某些API提供。 所以你需要实现输入格式,在你自己的逻辑中分割数据(只要没有文件,这是你自己的任务),并将数据切割成记录 。 请注意,您的内存数据源应该在群集的所有节点上分发和运行。 您还需要一些有效的IPC机制来将数据从您的流程传递到Mapper流程。 我很高兴知道你的情况是什么导致这个不寻常的要求。

Input format is not have to load data from the disk or file system. There are also input formats reading data from other systems like HBase or (http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapred/TableInputFormat.html) where data is not implied to sit on the disk. It only is implied to be available via some API on all nodes of the cluster. So you need to implement input format which splits data in your own logic (as soon as there is no files it is your own task) and to chop the data into records . Please note that your in memory data source should be distributed and run on all nodes of the cluster. You will also need some efficient IPC mechanism to pass data from your process to the Mapper process. I would be glad also to know what is your case which leads to this unusual requirement.