Information Retrieval project based on Learning Author Vector Representations using collaboration of authors for research papers.


The aim of the project is to find good vector representations for authors who publish scientific research. These representations should be such that authors who work in same domain ( i.e. same research area ) must be closer in vector space. These representations helps to categorize or cluster authors into various categories and further predict future collaboration based on past data. The idea is to form a model which learns author representations such that authors who write similar content and share similar network structure are closer in vector space.


The DBLP computer science bibliography contains the metadata of publications, written by several authors in thousands of journals or conference proceedings series. We have used a subset of the dataset which has metadata of around 2,75,000 papers.


Based on the dataset of the authors, a co-authorship network is formed which can be represented as Graph in which each vertex is considered as a author and each edge represents a collaboration between them. Given a number of papers in a dataset and embedding size, our goal is to learn the vector representation of authors. In our model, vector embeddings (randomly intialised) for two authors is taken as an input. The training tuples consist of positive input pairs (where one author has collaborated with another) and negative input pairs (where one author has never collaborated with the other author in the training set). This setup effectively pushes the authors who share similar network structure closer in vector space from the irrelevant authors.


Techniques which are used in learning the transformation of raw data input into a vector representation which can be used in various machine learning tasks are called Representation Learning or Feature Learning Techniques. They have gained a great success in various applications like image processing, speech recognition and natural language processing (NLP). Following steps are performed.

Text - Processing

  1. This process includes parsing the dataset file to find unique authors and provide them with an id.

  2. Then the co-authorship information is extracted i.e. List of the authors who have collaborated with each other is made.

  3. Each author is assigned with a label ( Topic of his/her publication ) by finding the topic in which the author has published maximum papers. In case of a tie, random topic is assigned.

Training Neural Network

The input to the neural network will be the refined co-authorship file which contains authors in positive and negative context w.r.t to every author. Here positive context means the authors who have collaborated with each other and negative context means those who have not collaborated. Open source tool TORCH is used for training the Neural Network. Neural network is feeded with the positive and negative samples and is being iterated for 10 epochs containing authors in the dataset and the vector representation for each author is learned. The vector representations are learned to finally get authors in positive context closer on vector space.

Classification of vector representations

Various algorithms are used to classify the vectors in a group and examined against a test corpus.


Random forest gives a mean accuracy of 28 percent and svm gives a mean accuracy of 30 (parameters C and gamma are tuned by Grid Search, C = 0.001 & gamma = 10).


generateid.py - Generate unique id for Authors in the dataset.

tensor.py - Generates authors_final.txt which has every line corresponding to author,authors in positive context and negative context.

NN.lua - neural network code to train the authors for vector representation.


sgd.py - performs Stochastic Gradient Descent Classification .

randomforest.py - performs Random Forest Classification .

svm.py - Support vector machine carried out with grid search.

How To run ?

$ pyhton classifier_file auth_vector auth_highfreqlabel

Input files

auth_vector - File contains the vectors for each author learned from the neural network.

auth_highfreqlabel - Each line corresponds to auth and the highest occuring label for the authors.