Scholarly Work
Our Publications
No publications match your search.
2026
5 papers
TOSEM 2026
Detecting Protracted Vulnerabilities in Open Source Projects
ACM Transactions on Software Engineering and Methodology (TOSEM) 2026
⬇ pdf coming soon
Abstract
Timely resolution and disclosure of vulnerabilities are essential for maintaining the security of open-source software. However, many vulnerabilities remain unreported, unpatched, or undisclosed for extended periods, exposing users to prolonged security risks. We investigate the vulnerability lifecycle by focusing on protracted vulnerabilities (PCVEs), which remain unresolved or undisclosed over long durations. We propose DeeptraVul, an enhanced detection approach tailored to protracted cases, integrating multiple development artifacts and code-level signals supported by a large language model-based summarization component.
TOSEM 2026
"Should I Give Up Now?" Investigating LLM Pitfalls in Software Engineering
ACM Transactions on Software Engineering and Methodology (TOSEM) 2026
Abstract
Software engineers are increasingly incorporating AI assistants into their workflows to enhance productivity and alleviate cognitive load. However, experiences with large language models (LLMs) such as ChatGPT vary widely. Analyzing data from 26 participants in a complex web development task, we identified nine failure types categorized into incorrect or incomplete responses, cognitive overload, and context loss. Our quantitative analysis revealed that unhelpful responses increased the likelihood of abandonment by a factor of 11, while each additional prompt reduced abandonment probability by 17%.
CHI 2026
Untangling the Timeline: Challenges and Opportunities in Supporting Version Control in Modern Computer-Aided Design
The ACM CHI Conference on Human Factors in Computing Systems (CHI) 2026
⬇ pdf coming soon
Abstract
Version control is critical in mechanical CAD to enable traceability, manage product variation, and support collaboration. This paper presents a systematic review of user-reported challenges with version control in modern CAD tools. Analyzing 170 online forum threads, we identify recurring socio-technical issues that span the management, continuity, scope, and distribution of versions. Our findings inform a broader reflection on how version control should be designed and improved for CAD.
CHI 2026
CADModelScope: Revealing the Dependency Structure Behind Parametric Computer-Aided Design Models
The ACM CHI Conference on Human Factors in Computing Systems (CHI) 2026
⬇ pdf coming soon
Abstract
Parametric CAD models are constructed by a sequence of operations, where each operation may reference geometries created by earlier ones. This network of dependencies enables efficient modelling of complex geometry but also results in fragile models where small modifications can trigger cascading errors. We present CADModelScope, a multi-level graph-based visualization of operation dependencies integrated into a commercial CAD platform.
ICSE 2026
Beyond Adoption: Examining the Evolution and Impact of Codes of Conduct on Open-Source Communities
The 48th IEEE/ACM International Conference on Software Engineering (ICSE) 2026
Abstract
While open source software (OSS) communities thrive on collaboration, conflicts such as toxic behavior and discrimination can surface, threatening the sustainability of these projects. To address these concerns, many communities have adopted a Code of Conduct (CoC). Our study compiles a large-scale dataset of CoCs along with their change histories in OSS repositories on GitHub to quantitatively understand the evolution of CoC content and investigate the potential impact of CoC adoption on community engagement. OSS communities with a CoC attract more new contributors and decrease the number of existing contributors disengaging from the community in the long term.
SERS 2026
Do Research Software Engineers and Software Engineering Researchers Speak the Same Language?
1st International Workshop on Software Engineering and Research Software (SERS 2026)
⬇ pdf coming soon
Abstract
Research Software Engineers (RSEs) often use different terminologies than the Software Engineering Research (SER) community for similar concepts. As an outcome of the Dagstuhl Seminar 24161, we developed an approach to explore these terminologies using crowd-sourcing to build a website presenting a "mapping of terms" between the groups.
2025
8 papers
CSCW 2025
It's a Complete Haystack: Understanding Dependency Management Needs in Computer-Aided Design
The 28th ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW) 2025
Abstract
Hardware development teams face increasing demands for better quality products, greater innovation, and shorter manufacturing lead times. One significant and unaddressed challenge is understanding and managing dependencies between 3D CAD models, especially when products can contain thousands of interconnected components. In this two-phase formative study, we explore designers' pain points of CAD dependency management through a thematic analysis of 100 online forum discussions and semi-structured interviews with 10 designers. We identify nine key challenges related to the traceability, navigation, and consistency of CAD dependencies.
CSCW 2025
Collaboration Challenges and Opportunities in Developing Scientific Open-Source Software Ecosystem: A Case Study on Astropy
The 28th ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW) 2025
Abstract
Scientific open-source software (OSS) has greatly benefited research communities through its transparent and collaborative nature. This study examines the challenges and opportunities for improving collaboration efficiency in the development and maintenance of scientific OSS. We conducted a mixed-methods case study on Astropy, including analysis of commit history, cross-referenced issues and pull requests, and interviews with core contributors.
CSCW 2025
Who is to Blame: A Comprehensive Review of Challenges and Opportunities in Designer-Developer Collaboration
The 28th ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW) 2025
Abstract
Software development relies on effective collaboration between Software Development Engineers (SDEs) and User eXperience Designers (UXDs). We conducted a systematic literature review of 45 papers published since 2004, uncovering three key collaboration challenges and two main categories of potential best practices. We then analyzed designer and developer forums and discussions from one open-source software repository to assess how the challenges and practices manifest in the status quo.
CiSE 2025
Do Research Software Engineers and Software Engineering Researchers Speak the Same Language?
Abstract
Anecdotal evidence suggests that Research Software Engineers (RSEs) and Software Engineering Researchers (SERs) often use different terminologies for similar concepts, creating communication challenges. Our preliminary findings reveal opportunities for mutual learning and collaboration, and our systematic methodology for terminology mapping provides a foundation for crowd-sourced extension and validation.
WWW 2025
MAML: Towards a Faster Web in Developing Regions
Abstract
The web experience in developing regions remains subpar, primarily due to the growing complexity of modern webpages. We introduce the Mobile Application Markup Language (MAML), a flat layout-based web specification language that reduces computational and data transmission demands, while replacing excessive bloat from JS with a new scripting language centered on essential web functionalities. When compared to Google AMP across 100 testing webpages, MAML offers speedups by tens of seconds under challenging network conditions.
CHASE 2025
Advancing Sustainable Communities in Scientific OSS: A Replication Study with Astropy
Abstract
Scientific OSS fosters transparency and collaboration. Through a survey-based replication study in the Astropy Project, we gathered insights from disengaged contributors regarding their motivations, reasons for disengagement, and suggestions for improving community sustainability. Our findings reveal key motivations driving scientific contributions to OSS and identify barriers to sustained engagement.
MOBILESoft 2025
LLMs in Mobile Apps: Practices, Challenges, and Opportunities
Abstract
We constructed a comprehensive dataset of 149 LLM-enabled Android apps and conducted an exploratory analysis to understand how LLMs are deployed and used within mobile apps. This analysis highlights key characteristics of the dataset, prevalent integration strategies, and common challenges developers face integrating LLMs under mobile device constraints, API management, and code infrastructure.
ICSE 2025
The Product Beyond the Model — An Empirical Study of Repositories of Open-Source ML Products
Abstract
We contribute a dataset of 262 open-source ML products for end users identified among more than half a million ML-related projects on GitHub. We qualitatively and quantitatively analyze 30 open-source ML products to answer six broad research questions about development practices and system architecture, reporting 21 findings including limited involvement of data scientists and unusually low modularity between ML and non-ML code.
2024
3 papers
ICSME 2024
Can We Do Better with What We Have Done? Unveiling the Potential of ML Pipeline in Notebooks
Abstract
Computational notebooks are widely adopted by data scientists for experimenting with machine learning models. We conduct a qualitative analysis to examine how data scientists explore various alternatives through a series of versions of notebooks on Kaggle. By combining alternatives from all stages to form previously unexplored paths, we discover that certain untested combinations can outperform the best models as identified in the original notebooks.
CSCW 2024
"A Lot of Moving Parts": A Case Study of Open-Source Hardware Design Collaboration in the Thingiverse Community
Abstract
We conduct a detailed case study of DrawBot, a successful open-source hardware project that remarkably fostered a long-term collaboration on Thingiverse — a platform not explicitly intended for complex collaborative design. Through analyzing comment threads and design changes, we found how collaboration occurred, the challenges faced, and how the DrawBot community managed to overcome these obstacles.
2023
6 papers
ICSME 2023
Aligning Documentation and Q&A Forum through Constrained Decoding with Weak Supervision
Abstract
Stack Overflow plays a supplementary role to official documentation by offering practical examples and resolving uncertainties. We propose DOSA, a novel approach to automatically align Stack Overflow and documentation, injecting domain-specific knowledge about the documentation structure into large language models through weak supervision and constrained decoding. Our preliminary experiments find that DOSA outperforms various widely-used baselines.
CSCW 2023
User Perspectives on Branching in Computer-Aided Design
Abstract
We mine and analyze 719 user-generated posts from online CAD forums to qualitatively study designers' intentions for and preliminary use of branching in CAD. Our work contributes a taxonomy of CAD branching use cases, an identification of deficiencies of existing branching capabilities in CAD, and a discussion of the untapped potential of CAD branching to support a new paradigm of collaborative mechanical design.
CSCW 2023
In the Age of Collaboration, the Computer-Aided Design Ecosystem is Behind: Evidence from an Interview Study of Distributed CAD Practice
Abstract
We conduct semi-structured interviews with 20 CAD professionals of diverse industries, roles, and experience levels to understand their collaborative workflows with distributed CAD tools. In total, we identify 14 challenges related to collaborative design, communication, data management, and permissioning that are currently impeding effective collaboration in professional CAD teams.
CHI 2023
Interaction of Thoughts: Towards Mediating Task Assignment in Human-AI Cooperation with a Capability-Aware Shared Mental Model
Abstract
We propose a capability-aware shared mental model (CASMM) for task assignment in human-AI cooperation, utilizing tuples to break down tasks into sets of scenarios and dynamically merging task grouping ideas through negotiation. A 3-phase user study via an image labeling task shows that building CASMM boosts accuracy and time efficiency significantly through forming task assignments close to real capabilities within few iterations.
CHI 2023
Aspirations and Practice of ML Model Documentation: Moving the Needle with Nudging and Traceability
Abstract
Our analysis of publicly available model cards reveals a substantial gap between the model cards proposal and the practice. We design a tool named DocML aiming to nudge data scientists to comply with the model cards proposal during model development and to assess and manage documentation quality. A lab study reveals the benefit of our tool towards long-term documentation quality and accountability.
CAIN 2023
★ Best Paper
A Meta-Summary of Challenges in Building Products with ML Components — Collecting Experiences from 4758+ Practitioners
Abstract
Incorporating machine learning components into software products raises new software-engineering challenges and exacerbates existing ones. We provide a meta-summary synthesizing findings from studies involving 4758+ practitioners, identifying recurring challenges and providing a consolidated view of the landscape of ML engineering challenges in industry practice.
2022
4 papers
CASCON 2022
Exploring Trends and Practices of Forks in Open-Source Software Repositories
32nd Annual International Conference on Computer Science and Software Engineering (CASCON) 2022
Abstract
Forking a software repository is a popular and recommended practice among developers. A fork is a copy of the original repository that can evolve independently from the parent repository, allowing developers to experiment with a code base or test new features without the danger of affecting the original project. In this work, we explore the motivation, the practices and the culture of forking open-source software repositories, studying how forks evolve compared to the parent repository, how they are related to pull requests, how they contribute back to the parent, and how dependencies are shared or differ within project families.
ICSME 2022 – NIER
Elevating Jupyter Notebook Maintenance Tooling by Identifying and Extracting Notebook Structures
International Conference on Software Maintenance and Evolution (ICSME) 2022 — New Ideas and Emerging Results Track (NIER)
Abstract
Computational notebooks have become a popular tool for data analysis, but notebooks in practice are often criticized as hard to maintain and being of low code quality. We argue that central to better tool support is identifying the structure of notebooks. We present a lightweight and accurate approach to extract notebook structure and outline several ways such structure can be used to improve maintenance tooling for notebooks, including navigation and finding alternatives.
IST 2022
An Empirical Study of Emoji Use in Software Development Communication
Information and Software Technology (IST) 2022
Abstract
We present a large-scale empirical study on the intention of emoji usage conducted on 2,712 Open Source Software projects. We build a machine learning model to automate classifying the intentions behind emoji usage in 39,980 posts. Our results show that we can classify the intention of emoji usage with high accuracy (AUC of 0.97), and that developers use emoji for varying intentions that change throughout a conversation.
ICSE 2022
Collaboration Challenges in Building ML-Enabled Software: Communication, Documentation, Engineering, and Process
44th International Conference on Software Engineering (ICSE) 2022
Abstract
Building ML-enabled software involves collaboration between team members with different backgrounds and expertise. We conducted an interview study to understand collaboration challenges in building ML-enabled software, identifying challenges around communication, documentation, engineering, and process.
2021
5 papers
FSE 2021
Studying the Effect of Pull Request Revert on Software Quality
ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE) 2021
Abstract
Pull requests are a central mechanism for code integration in modern collaborative software development. This study examines the effects of reverted pull requests on software quality, analyzing large-scale repository data to understand when and why pull requests are reverted and what impact this has on the codebase.
RAISE 2021
Splitting, Renaming, Removing: A Study of Common Cleaning Activities in Jupyter Notebooks
8th International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE) 2021
Abstract
Data scientists commonly use computational notebooks because they provide a good environment for testing multiple models. In this paper, we perform a qualitative study on how scientists clean their code. By sampling notebooks from GitHub and analyzing changes between subsequent commits, we identified common cleaning activities, such as changes to markdown or comments as well as reordering cells. Our results provide a valuable foundation for tool builders and notebook users.
JSME 2021
Perceptions of Open-Source Software Developers on Collaborations: An Interview and Survey Study
Journal of Software: Evolution and Process (JSME) 2021
Abstract
We investigate the perceptions of open-source software developers on collaborations, such as motivations, techniques, and tools to support global, productive, and collaborative development. Following an interview study with 12 open-source software developers from GitHub, we conducted an extensive survey with 121 developers. We found that most collaborators prefer to collaborate with the core team, and most collaboration happens in software development and maintenance tasks.
ASE 2021
Subtle Bugs Everywhere: Generating Documentation for Data Wrangling Code
36th IEEE/ACM International Conference on Automated Software Engineering (ASE) 2021
Abstract
Data scientists reportedly spend a significant amount of their time on data wrangling. We present a technique to generate interactive documentation for data wrangling code using program synthesis techniques to automatically summarize data transformations and test case selection techniques to purposefully select representative examples. A user study shows that users with our JupyterLab plugin are faster and more effective at finding realistic bugs in data wrangling code.
🏆 Distinguished Paper Award
ICSME 2021
Interactive Patch Filtering as Debugging Aid
37th International Conference on Software Maintenance and Evolution (ICSME) 2021
Abstract
We propose an interactive patch filtering approach to facilitate developers in the patch review process via effectively filtering out groups of incorrect patches. We implemented the approach as an Eclipse plugin, InPaFer, and evaluated its effectiveness. The results show that our approach improves the repair performance of developers, with 62.5% more successfully repaired bugs and 25.3% less debugging time.
≤2020
12 papers
ICSE 2020
How Has Forking Changed in the Last 20 Years? A Study of Hard Forks on GitHub
42nd International Conference on Software Engineering (ICSE) 2020 — Acceptance rate: 20.9% (129/617)
Abstract
The notion of forking has changed with the rise of distributed version control systems and social coding environments like GitHub. To revisit hard forks, we identify, study, and classify 15,306 hard forks on GitHub and interview 18 owners of hard forks or forked repositories. We find that hard forks often evolve out of social forks rather than being planned deliberately and that perceptions about hard forks have changed dramatically, seeing them often as a positive noncompetitive alternative to the original project.
ICGSE 2020
Understanding Collaborative Software Development: An Interview Study
15th ACM/IEEE International Conference on Global Software Engineering (ICGSE) 2020
Abstract
This paper presents an interview study aiming to understand the motivations, how collaboration happens, and the challenges and barriers of collaborative software development. After interviewing twelve experienced software developers from GitHub, we found different types of collaborative contributions. Our analysis indicates that the main barriers for collaboration are related to non-technical, rather than technical issues.
MSR 2020 – Mining Challenge
An Exploratory Study to Find Motives behind Cross-platform Forks from Software Heritage Dataset
17th International Conference on Mining Software Repositories (MSR) 2020 — Mining Challenge Track
Abstract
With the advances of Software Heritage Graph Dataset, we have the opportunity to investigate forking activities across platforms. We conduct an exploratory study on 10 popular open-source projects to identify cross-platform forks and investigate the motivation behind. We found that most cross-platform forks are mirrors of repositories on another platform, but we still find cases created due to preference of using certain functionalities supported by different platforms.
FSE 2019
What the Fork: A Study of Inefficient and Efficient Forking Practices in Social Coding
27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) 2019 — Acceptance rate: 24% (74/303)
Abstract
Forking and pull requests have been widely used in open-source communities as uniform development and contribution mechanisms. However, some projects observe severe inefficiencies, including lost and duplicate contributions and fragmented communities. Using logistic regression models, we analyzed the association of context factors with inefficiencies and found that better modularity and centralized management can encourage more contributions and a higher fraction of accepted pull requests.
SANER 2019
Identifying Redundancies in Fork-based Development
27th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) 2019 — Acceptance rate: 27% (40/148)
Abstract
Fork-based development makes it difficult to maintain an overview of the whole community when the number of forks increases, which may lead to redundant development. We designed an approach to identify redundant code changes in forks as early as possible by building a machine learning model to predict redundancies. The result shows 57-83% precision for detecting duplicate code changes, and we could save developers' effort of 1.9-3.0 commits on average.
ISSRE 2019
How to Explain a Patch: An Empirical Study of Patch Explanations in Open Source Projects
30th International Symposium on Software Reliability Engineering (ISSRE) 2019
Abstract
We explored how developers explain their patches by manually analyzing 300 merged bug-fixing pull requests from six projects on GitHub. We build a patch explanation model which summarizes the elements in a patch explanation and corresponding expressive forms. We also conducted a quantitative analysis to understand the distributions of elements and the correlation between elements and their expressive forms.
ASE 2019 – Doctoral Symposium
Improving Collaboration Efficiency in Fork-based Development
Companion of the International Conference on Automated Software Engineering (ASE) 2019
ICSE 2018 – Poster
Poster: Forks Insight: Providing an Overview of GitHub Forks
Companion of the International Conference on Software Engineering (ICSE) 2018 — Poster
ICSE 2018
Identifying Features in Forks
40th International Conference on Software Engineering (ICSE) 2018 — Acceptance rate: 21%
Abstract
We introduced INFOX, an approach to automatically identify not-merged features in forks and generate an overview of active forks in a project. The approach clusters cohesive code fragments using code and network analysis techniques and uses information-retrieval techniques to label clusters with keywords. The clustering is effective, with 90% accuracy on a set of known features, and a human-subject evaluation shows that INFOX can provide actionable insight for developers of forks.
ICSE 2018
Adding Sparkle to Social Coding: An Empirical Study of Repository Badges in the npm Ecosystem
40th International Conference on Software Engineering (ICSE) 2018 — Acceptance rate: 21%
Abstract
We report on a large-scale, mixed-methods empirical study of npm packages exploring the emerging phenomenon of repository badges. After surveying developers, mining 294,941 repositories, and applying statistical modeling and time series analysis, we find that non-trivial badges are mostly reliable signals, correlating with more tests, better pull requests, and fresher dependencies.
Releng 2015
Extracting Configuration Knowledge from Build Files with Symbolic Analysis
Abstract
Build systems contain a lot of configuration knowledge about a software system, such as under which conditions specific files are compiled. We design an approach, based on SYMake, that symbolically evaluates Makefiles and extracts configuration knowledge in terms of file presence conditions and conditional parameters.
Internetware 2013
Elastic Resource Management for Heterogeneous Applications on PaaS
5th Asia-Pacific Symposium on Internetware 2013 — ACM, New York, NY
Abstract
We propose a practical and effective elasticity approach based on the analysis of application features — CPU consumption, I/O consumption, and request rate. The evaluation experiment shows that, compared with traditional approaches, our approach can save up to 32.8% VMs without significant increase of average response time and SLA violation.