在稀疏图(矩阵)的表示中查找元素(邻居)(Finding elements (neighbors) in a representation of a sparse graph (matrix))

我有一个大的稀疏图形,我表示为邻接矩阵(100k×100k或更大),存储为边数组。 具有(非稀疏)4乘4矩阵的示例:

0 7 4 0 example_array = [ [7,1,2], [4,2,1] ]

例如[4,1,2]表示从节点1到节点2有一个带有值/权重4的有向边。在矩阵术语中,这基本上是[值,行,列]。

此外,这个“边缘阵列”将按第一个元素排序。 在上面的例子中,排序后,数组变为,

example_array = [ [4,2,1], [7,1,2] ]

问题

对于某个值i ,需要找到此排序的“边缘数组”中的所有元素,其中第二个值等于i 。 即找到j使得example_array[j][1] = i 。

我的初步实现是简单地迭代数组中的所有元素,将每个元素的第二个值与i进行比较。 这在计算上是昂贵的,因为可能仍有很多(例如500k)元素要循环。

有没有更有效的方法来做到这一点? 我不介意使用矩阵/图的不同表示。 我是用C写的。

附加信息

这基本上是找到节点i所有邻居及其边缘权重。 即从边缘列表中查找从i到另一个节点的所有有向边。

I have a large sparse graph that I am representing as an adjacency matrix (100k by 100k or bigger), stored as an array of edges. An example with a (non-sparse) 4 by 4 matrix:

0 7 4 0 example_array = [ [7,1,2], [4,2,1] ]

E.g. [4,1,2] says that there is a directed edge from node 1 to node 2 with the value/weight 4. In matrix lingo, this is essentially [ value, row, column ].

Also, this "array of edges" will be sorted by the first element. In the example above, after sorting, the array becomes,

example_array = [ [4,2,1], [7,1,2] ]

Problem

For a certain value i, need to find all elements in this sorted "array of edges" with second value equal to i. i.e. Find j such that example_array[j][1] = i.

My preliminary implementation of this is to simply iterate all elements in the array, comparing the second value of each element with i. This is computationally expensive because there might still be a lot (e.g. 500k) of elements to loop through.

Question

Is there a more efficient way to do this? I do not mind using a different representation of the matrix/graph. I am writing this in C.

Additional Information

This is essentially finding all the neighbors of a node i, and their edge weights. i.e. Finding all directed edges from i to another node, from the edge list.

最满意答案

您可能应该使用稀疏压缩行存储来实现此目的。 简而言之,您逐行存储矩阵,因此您不需要保留两个(行,列)索引。 而是保留一个行指针,即一个数组,它告诉你给定行在内存中的起始位置。 然后保留列向量(col_ind),它告诉您该行中非零列的位置,并存储相应的值(val)。 这降低了存储要求,但也加快了矩阵搜索,因为每行的col_ind都已排序。 因此,您可以直接访问每个矩阵行,并且可以使用二分法或您选择的任何其他排序列表搜索快速本地化每行中的条目。

如果您已明确构造每个矩阵条目的(i,j)坐标,则可以使用插入到排序列表中快速创建CRS矩阵,或者例如桶排序。 在MATLAB中,您可以使用“稀疏”功能执行此操作。 如果您不想自己编写代码并需要库,请查看Tim Davis的SuiteSparse 。

有关CRS格式的简要说明,请参见此处 ,但有数千个其他来源。

编辑您可以使用修改后的CRS存储轻松完成所需操作。 首先,您需要通过按值而不是列索引对每行内的列进行排序来创建矩阵,这通常是这样做的。 这意味着每行的最小值存储为每行中的第一个条目。 然后,要查找全局最小值,请搜索所有行的第一个条目(O(n)复杂度)。 通过读取包含最小值的行中的第一列索引,知道您在恒定时间内获得相应的列索引。 你可以做到这一切,因为你知道你的行指针在内存中的行开始位置。

你可以看看这段代码 。 它是用于在C中实现的matlab的一组mex文件。您感兴趣的是sparse_create_mex.c。 它使用(i,j,value)迭代添加到排序列表中来创建稀疏矩阵结构。 您需要稍微修改排序列表 - 现在它们是针对整数列索引和双值实现的。 由于排序列表是作为宏模板实现的,因此您只需要声明一个新的排序列表类型(请参阅sorted_list.h和sorted_list_templates.h)。

You should probably use sparse Compressed Row Storage for that purpose. Briefly, you store the matrix row by row, so you don't need to keep two (row,column) indices. Instead you keep a row pointer, i.e., an array which tells you where a given row starts in memory. Then you keep the column vector (col_ind), which tells you the position of the non-zero column in that row, and you store the corresponding value (val). This cuts down the storage requirements, but also speeds up matrix searches, since the col_ind for every row are sorted. So, you have a direct access to every matrix row, and you can quickly localize entries within every row using bisection, or any other sorted list search of your choice.

CRS matrix can be quickly created using insertion into sorted lists, or e.g. bucket sort if you have explicitly constructed (i,j) coordinates of every matrix entry. In MATLAB you can do this using 'sparse' function. If you don't want to code it yourself and need a library, have a look at SuiteSparse by Tim Davis.

For a brief description of the CRS format look e.g. here, but there are thousands of other sources for it.

Edit You can do what you need easily with a modified CRS storage. First, you need to create the matrix by sorting the columns within each row by value instead of the column index, as is usually done. This means that the smallest value for each row is stored as the first entry in every row. Then, to find the globally smallest value, you search the first entry of all rows (O(n) complexity). Knowing that you obtain the corresponding column index in constant time by reading the first column index in the row containing the smallest value. You can do all that because you know where the rows start in memory thanks to your row pointer.

You can have a look at this code. It is a set of mex files for matlab implemented in C. What you are interested in is sparse_create_mex.c. It creates the sparse matrix structure using iterative addition of (i, j, value) into sorted lists. You would need to modify the sorted lists a bit - right now they are implemented for integer column indices and double values. Since the sorted lists are implemented as macro templates, you only need to declare a new sorted list type (see sorted_list.h and sorted_list_templates.h).

更多推荐