Research Overview

My research focuses on a central question: How do people behave and misbehave in the realm of software ecosystems? I study open-source ecosystems such as GitHub, where millions of developers create, reuse, and share code. These ecosystems reflect how people collaborate, innovate, and sometimes exploit shared software. My work integrates software engineering, cybersecurity, and data science to understand these patterns and to design tools that make open-source development safer, more transparent, and easier to explore.


Research Projects

Tracing Technogeek Identities (GeekMAN)

Many developers, particularly in hacking and gaming communities, use creative technogeek usernames such as z3r0c001 or B14CKH4K3R. These stylized identities make it difficult to link the same person across different platforms.

We developed GeekMAN, a systematic approach that connects technogeek usernames across forums, GitHub, and social platforms. The system translates leetspeak into readable text, splits complex handles into meaningful parts, and compares them using semantic similarity measures. GeekMAN achieved up to 86% precision on technogeek datasets, improving cross-platform linkage by 10-20 percentage points over previous methods. Our approach is available as a publicly accessible research tool.

Searching and Understanding Code Ecosystems (MetaSim and RepoScope)

Developers frequently search GitHub to learn or reuse code, but existing search tools remain largely keyword-based and opaque. The key question that we address is: Given one GitHub repository, how can we find others that are similar in purpose and functionality?

To answer this, we created MetaSim and RepoScope, two systems for exploring code ecosystems. MetaSim studies how metadata such as repository descriptions, topics, and README files define functional similarity, showing that combining these signals yields a more accurate and interpretable view of project relationships. RepoScope expands this work to support "search by example repository". It enables multi-level exploration from titles and metadata to source-code embeddings and visualizes clusters of related projects for clearer interpretation.

Ongoing Work: Understanding Evolving Threats (Malware and Vulnerabilities)

A growing part of my research focuses on how harmful or vulnerable code spreads within open-source ecosystems. We are pursuing two main projects in this area:

  • Malware Ecosystem on GitHub: We are studying the thousands of repositories on GitHub that host malicious or dual-use code. We aim to uncover the technical and social mechanisms that shape how these repositories are created, forked, maintained, and how they evolve.
  • Orphy (Orphan Vulnerabilities): We are examining how vulnerabilities propagate across forks and clones, often remaining "orphaned" when patches fail to reach derivative projects. We analyze how trust, collaboration, and communication influence the adoption of fixes, combining social-network and software-artifact data to understand why orphan vulnerabilities persist.