What does all of that mean in English?

The graduate student I work with, Emily Hill, was conducting an experiment to test a tool that searches the natural language (like English, as opposed to a programming language, like Java) used in the code of a program. She ran into a problem: all of the subjects involved in the experiment were unable to find the pieces of code responsible for creating text fields in a given program.

The reason? To create a text field in this program, the user would use the “text field tool.” Therefore, all of the subjects searched for “text field” while trying to find the relevant code. The actual piece of code, however, referred to the tool as the “TextfieldTool.” If it had been called the “TextFieldTool” with a capital F, then the natural language search tool being tested would have returned the correct pieces of code, because it would have considered the change from lowercase to a capital letter equivalent to a break between words.

The solution? To create an algorithm that could split “textfield” into “text field,” or, more generally, that could split any identifier into its component words. I was tasked with this problem, and the algorithm I developed is called Samurai.

Samurai is capable of splitting an identifier into its component words regardless of the presence of capital letters. In an evaluation, it was found to be more accurate than any other algorithm.

Automated software engineering tools (e.g., program search, concern location, code reuse, quality assessment, etc.) increasingly rely on natural language information from comments and identifiers in code. The first step in analyzing words from identifiers requires splitting identifiers into their constituent words. Unlike natural languages, where space and punctuation are used to delineate words, identifiers cannot contain spaces. One common way to split identifiers is to follow programming language naming conventions. For example, Java programmers often use camel case, where words are delineated by uppercase letters or non-alphabetic characters. However, programmers also create identifiers by concatenating sequences of words together with no discernible delineation, which poses challenges to automatic identifier splitting.

In this paper, we present an algorithm to automatically split identifiers into sequences of words by mining word frequencies in source code. With these word frequencies, our identifier splitter uses a scoring technique to automatically select the most appropriate partitioning for an identifier. In an evaluation of over 8000 identifiers from open source Java programs, our Samurai approach outperforms the existing state of the art techniques.

Mining Source Code to Automatically Split Identifiers for Software Analysis


IEEE Copyright Notice

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without permission of the copyright holder.

ACM Copyright Notice

The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.