resarch:nlpa:research

Introduction and Overview
Throughout development, maintenance and evolution of a software system, developers are faced with continual learning. They may be onboarding to a new software project and team, learning how to use a new API or new part of an API, determining how to implement a specific feature, identifying which language feature or design pattern is most appropriate for a particular task, or learning which tool to use and how to use that tool to work more efficiently. Both novice and experienced developers find themselves in these situations as applications, libraries, languages, tools and techniques evolve.
Problems

using different sources of information and associated tools that record that information - e.g., versioning systems, issue trackers, question-answer forums, emails, documentation resources, and the code itself.
querying over different systems separately and aggregating results manually, relying on knowing what and how to best query, knowing the resources to query over for best results, and iterating the process, resulting in a costly, tedious learning process.

Solution
Automated assistance to ease the learning process for software developers in the context of their current maintenance or evolution task.
How?
extracting and exploiting valuable information from programmers’ natural language (NL) usage in source code (identifiers, method signatures, and comments) [33, 37, 38, 46, 78, 94, 95, 111] and state of the art techniques in traceability [1, 4, 22, 23, 25, 65].
Challenges

to enable a granularity of textual analysis of source code that provides more accurate semantic information, bridging the wide gap between single word and method level analysis,
to automatically extract and distinguish between different kinds of facts, advice on good practices and pitfalls, and examples (even code templates) to learn from across a wide set of different available software artifacts containing natural language and code snippets (mixed text-code documents), and
to determine adequate information about the developers’ current context to use in place of, or along with a user query, for relating the mined learning nuggets to the current developer context.

Current Traceability We will contribute to the state of the art by developing analyses

(1) to automatically extract, describe, and generalize information from source code at the level of granularity between single words and whole methods, that is, at the multi-statement, algorithm-step level,
(2) to automatically identify, extract and distinguish between different kinds of information such as facts, (positive and negative) opinions/advice, and usage information in mixed text-code artifacts such as emails, question-answer forums, and other developer communications, and
(3) to automatically identify the relevant context of the developer and relate to the appropriate learning nuggets. We will perform well-designed studies to evaluate the evaluate the evaluate the effectiveness of the proposed research.

Action Unit Granularity for Text Analysis of Source Code
The first goal of this task is to automatically identify such action units because they provide a level of semantic meaning between individual statements and whole methods that can be leveraged in supporting learning in context.
The action unit also appears to be at the same granularity as examples that developers write for API usage as part of documentation [16]. Action units provide a fine granularity of context surrounding an API call, and serve as natural candidates for code examples and for generating code templates. In addition, by identifying and generating descriptions for action units within a method, the accuracy of text analysis of source code could be improved. For example, an action unit might be an “update” or “check” for some variable, but there are no such words in the code. Relying only on matching words in the source code would not attain the same accuracy.

Identification of Action Units.
Data flow chain
Extended SWIFT
Action Unit Similarity, Clustering, and Example and Code Template Generation

Refine the approach to identify action units based on more evaluation and data analysis.
Develop an algorithm to detect similarity between action units and cluster similar units.
Develop a technique to generalize from action unit clusters to generate code templates.
Evaluate effectiveness of action units for intended uses in developer learning in context.

Identifying/Characterizing Facts & Advice from Mixed Text-Code Artifacts

Perform manual analysis of different kinds of mixed text-code artifacts to learn clues for facts of different kinds and positive and negative polarity advice.
Develop a corpus of text and annotate with category information about facts and advice.
Based on the learned clues or features, develop a rule-based system or machine-learning to automatically recognize the clues to identify different kinds of facts and advice in mixed text-code artifacts.
Build a system based on the approach, evaluate, and refine the approach.

Context-based Learning Nugget Analysis

develop techniques to associate our mined learning nuggets with corresponding software components,
implement, evaluate and refine the current techniques for representing and using developer context to extract the learning nuggets for the given context, using the established associations.