I came across this problem recently while working on a machine learning project. It turns out that while at first glance it might appear simple, it takes some work to get it working properly.
Hopefully somebody that needs to do something similar will find this guide useful.
I had a large text file with several thousand feature vectors. Each row/vector was of the format :
Label Feature1:Value1 Feature2:Value2 ... FeatureN:ValueN
I also had some Python and MATLAB programs which needed the above representation to be converted to 2 matrices :
- X – this matrix has the feature vectors
- Y – a column vector containing the labels
The X matrix could have upto a million features. Therefore, X needed to be sparse.
Convert Text File -> A portable format readable by Python or MATLAB.
This can be accomplished using the Matrix Market I/O Format(http://math.nist.gov/MatrixMarket/)
Python does have a savemat and loadmat function. But the .mtx file is supposed to be readable across different languages and does this job quite easily.
The necessary imports
import numpy as np from scipy import sparse import scipy.io as sio
Read from the file with the features
f=open("filename.txt") lines=f.readlines() f.close()
Initialize our X and Y arrays
X = sparse.lil_matrix((<Number of rows>,<Maximum number of features)) Y = np.zeros((<Number of rows>,1))
The meat of the script
for i in range(len(lines)): line=lines[i].strip() label = int(line.split(' ')) entries = line.split(' ')[1:] Y[i,0]=label for entry in entries: feature = int(entry.split(':')) value = int(entry.split(':')) X[i,feature]=value
Saving the matrices to a file
Notice that mmwrite() command has no problem with the X matrix being a sparse one(lil_matrix or Row-based linked list sparse matrix to be more precise)
Reading the mtx file
This can be done easily using the mmread command:
sio.mmread('X.mtx') #Or any other mtx file
This part is not quite clear in a cursory google search. You need to download the *.m MATLAB files from here : http://math.nist.gov/MatrixMarket/mmio/matlab/mmiomatlab.html
Remember to place them in your current working directory in MATLAB. After that,
[X, rows, cols, entries] = mmread('X.mtx');
Few notes on performance
The file that I used had a matrix of around 4000 rows and 700,000 columns and approximately 25% sparsity. The Scipy mmwrite function generated a 603 MB file.
After reading into MATLAB, I used the inbuilt save function. The resulting .mat file had a size of only 40 MB. I guess MATLAB’s file format is highly optimized for sparse arrays.