Useful Links

Here are some identifiers for code elements within mixed documents found in relevant sources:

Bachelli et al. 2011: Uses naming conventions and capitalization, i.e., camel casing to identify fragments. States that they use a context-free grammar to identify stuctured fragments, but doesn't really specify how or give an example of the entries in their CFG.

Dagenais & Robillard 2012: “Code-like term” is defined as “a series of characters that matches a pattern associated with a type of code element”, e.g., parentheses for functions, camel casing for types, anchors for XML elements. There are also “code-like term lists”, which are sequences of code-like terms and “code snippets”, which are “small regions of source code that can be further divided into a list of code-like terms”. Identification of the aforementioned code-like terms or incorporations thereof occurs by lightweight techniques based on regular expressions.

Rigby & Robillard 2013: Naming conventions, camel casing and lightweight techniques based on regular expressions, just as in Bachelli et al. 2010. Uses “regular expressions approximated following constructs in the Java Language Specification”: qualified terms, package names, variable declarations, qualified variables, method chains, class definitions including inheritance, declarations, method overrides, inner classes, constructors, stack traces, annotations, and exceptions. Regular expressions are ordered from most precise to most flexible.

Bachelli et al. 2010: References entities by their names. Use of camel casing for class names. Original source for lightweight linking technique. Switches to strict matching by name when a code-like term turns out not to be compound. Has a regular expression that accounts for punctuation within names because entity name can be written as a single word separated from others by empty space or connected by punctuation.

Antoniol et al. 2002: Assumes “people use meaningful names for program items”, capitalizes on mnemonics for identifiers.

Xie et al. 2012: Might be useful if we could get more information as to their methods. Paper is on identifying sentences that contain the information required for code contracts and inferring method specifications from that information, but it doesn't state how the sentences are identified in the first place.