Search This Blog

Friday, June 12, 2015

Large Sparse Matrix-Vector dot product in Python with Key-Value Storages

Scipy and Numpy modules are pretty handy Python modules to deal with many numerical operations in a stand-alone machine. As for large matrix, a lot of tools in distributed/parallel environments exist, but stand-alone tools are still handy in many situations.

I created a random sparse matrix that can represent the same size of graph of Live Journal data set in SNAP. As the network contains 4847571 vertices and 68993773 edges, I created a 4847571 x 4847571 matrix with 68993773 non-zero elements. I firstly considered scipy.sparse module.

from scipy.sparse import dok_matrix
import numpy as np
def matrixGen():
        subscripts = np.random.randint(0,high=4847571-1, size=(68993773,2))
        matrix = dok_matrix((4847571,4847571),dtype=np.int)
        for i in range(0,68993773):
            matrix[subscripts[i,0],subscripts[i,1]] = 1
            if i%10000 == 0:
                print i
        return matrix
However, I hit the wall. In my iPython shell, the above function did not end on my macbook air (8GB RAM):

In [9]: %time matrix = matrixGen()

it stalled after the screen printed 35700000 for more than 30 minutes. So, I stopped it. Even, I cannot construct the desired matrix on my system. I searched for alternative ways. Store the matrix on a key-value storage, which store matrix a key-value pair of ("i,j", 1), where "i,j" is the index and 1 is the non-zero term. I firstly tried LMDB (http://symas.com/mdb/ ) , which is known to one of the fastest key-value storage that can be embedded into an application.

class Graph:
    def __init__ (self,name, **kwargs):
        self.name = name
        self.vname = "v"
        self.env = lmdb.open(os.path.join(os.getcwd(), 'db'), map_size = 5000000000,max_dbs=2,sync=False, writemap=True, map_async=True)
        self.db = self.env.open_db(self.name)
        self.vdb = self.env.open_db(self.vname)
   
    def Open(self):
        with self.env.begin(db = self.db, write=True) as txn:
            results = json.loads(txn.get(self.name))
            self.vertices = results['v']
            self.edges = results['e']
   
    def Generator(self, vertices, edges):
        " Generator(self,vertices, edges) generates a graph in a sparse matrix form"
        self.vertices = vertices
        self.edges = edges
      
        with self.env.begin(db = self.db, write=True) as txn:
            subscripts = np.random.randint(0,  high=self.vertices-1, size=(self.edges,2))
            for i in range(0,self.edges):
                if i % 10000 == 0:
                    self.env.sync()
                    txn.commit()
                    print 100.0*i/68993773.0, "%"
                txn.put(self.name+":"+str(subscripts[i,0])+","+str(subscripts[i,1]),"1")
            self.env.sync()
            txn.put(self.name, json.dumps({"v":self.vertices, "e":self.edges}))
            self.env.sync()

Of course, I didn't have trouble to generate the matrix. It took around 10~15 minutes on my Macbook Air (8GB RAM) (I didn't record it so I can't remember the exact time.)

After that, I tried dotProduct.

def dotProduct(self,vector,batchSize=2000):
        with self.env.begin(db=self.db) as txn:
            with self.env.begin(db=self.vdb, write=True) as vtxn:
                cursor = txn.cursor(db = self.db)
                vcursor = vtxn.cursor(db=self.vdb)
                vcursor.first()
                cursor.set_range(self.name+":"+"0,0")
                rowID = -1;
                count = 0
                rowMatrix=dok_matrix((batchSize,self.vertices), dtype= np.int)
                for k, v in cursor:
                    subscripts = k.split(":")[1].split(",")
                    if rowID != int(subscripts[0]): # entered into a new row.
                        rowID = int(subscripts[0])
                        count += 1
                        if rowID %batchSize == 0:
                            product = rowMatrix.tocsr().dot(vector)
                            print product.shape
                            for i in range(0,len(product)):
                                vcursor.put(self.vname+":"+str(rowID+i),product[i,])
                            rowMatrix=dok_matrix((batchSize,self.vertices), dtype= np.int)
                            print 100.0* float(count)/self.vertices, "%"
                    colID = int(subscripts[1])
                    rowMatrix[rowID%batchSize,colID] = int(v)
                product = rowMatrix.tocsr().dot(vector)
                for i in range(0,len(product)):
                    vcursor.put(self.vname+":"+str(rowID+i),product[i,])
                rowMatrix=dok_matrix((batchSize,self.vertices), dtype= np.int)
                print 100.0* float(count)/self.vertices, "%"
           
I thought the dot product will be more efficient when we do it as with a batch, instead of doing it in every iteration. So, I created a dok_matrix, which can be incrementally constructed. And then, (1) I tried dot product directly with dok_matrix. Also, (2) I tried dot product after converting the dok_matrix into csr_matrix. For the first two cases, I maintained the whole vector within Python, but (3) I also tried to put the vector into LMDB, while matrix calculation is done in csr_matrix. So I tried three cases:

It turned out not dismal. I could finish the matrix-vector dot product, where the sparse matrix contains 68,993,773 non-zero terms and the dimension of the dense vector is 4,847,571. We observe that csr_matrix (red) is more efficient than dok_matrix (blue). Also, we find that putting the vector on the LMDB is more efficient.

When the result dense vector is stored in np.array, the computation gets more efficient when the block size increases. However, when the result dense vector is stored on LMDB, we observe flat or negative performance impact from the larger block size.  I believe that we could interpret this as a negative performance with too large bulk insertion on the LMDB. (LMDB is a b tree based, which is known to read-optimized rather than write-optimized).  However, when the block size was 1000 and 2000, the execution time was not that different.

After this, I thought LevelDB might be another choice. LevelDB is from Google, which is more write-optimized, using Log-Structured-Merge tree, relevant python code is as follow:


    def Open(self, name):
        self.name = name
        graphInfo=json.loads(self.db.Get(name))
        self.vertices = graphInfo['vertices']
        self.edges = graphInfo['edges']
        self.cursor = 0
       
    def Take(self, rows):
        results={}
        end = min(self.cursor+rows, self.vertices)
      
        for row in range(self.cursor,end):
            result = self.db.RangeIter(key_from=self.name+":"+str(row)+",0", key_to=self.name+":"+str(row)+","+str(self.vertices-1))
            results[row] = [(k.split(":")[1].split(","), v) for k,v in result]
        self.cursor = end
        return results
   
    def dotProduct(self,vector,batchSize=2000):
       
        self.cursor=-1
        batch = leveldb.WriteBatch()
        for row in range(0, self.vertices, batchSize):
            results = self.Take(batchSize)
            matrix = dok_matrix((batchSize,self.vertices), dtype=np.int)
            for key in sorted(results.keys()):
                for item in sorted(results[key]):
                    matrix[int(item[0][0])%batchSize, int(item[0][1])] = int(item[1])
            product=matrix.tocsr().dot(vector)
            for pos in range(0,len(product)):
                batch.Put("result"+":"+str(pos+row*batchSize), str(product[pos]))
        self.db.Write(batch)

As I've found the benefits of converting dok_matrix to csr_matrix to calculate the dot product, I only tested csr_matrix case for LevelDB. Trends are as shown in following figure.


It is similar to LMDB cases. Note that the best in LevelDB was 13 minutes 51 seconds, while the best in LMDB was 12 minutes 27 seconds.