\\ **Extracting Source Code from E-Mails**
\\ Authors: Alberto Bacchelli, Marco D'Ambros, Michele Lanza

[[http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5521781|Link]]

**Problem:** Mine archived emails to support program comprehension activities and provide views of a software system that are alternative and complementary to those offered by the source code.

**\\ Importance:** Programmers who need to know the design rationale behind an implementation have to communicate with other developers, as the information stored in different artifacts(eg: source code, design documents, bug reports, chat logs etc) emphasize different aspects of the system's evolution.

**Approach:**
\\ __Classify emails containing source code__ - by
\\ 1. No. of occurrences of Java keywords/special characters
\\ 2. End of line (ends with semicolon)
\\ 3. Check on method call pattern using regular expressions.
\\ 4. Beginning of block.
\\ __Extract the source code pieces inside the emails.__

**Previous work:**
Work by Bettenburg: Use of island parser to extract code snippets from bug reports, gave almost perfect results(P=0.98 R=0.99)
But using a parser for extracting source code from emails involves 
\\ 1.high computational effort 
\\ 2.scaling up to archives might be difficult
\\ 3. Mailing list as natural language documents more prone to noise.

Hence, devised lightweight and easy to implement approaches.

**
\\ Benchmark:**
5 different open source Java projects with different development paradigms.--> randomly picked emails and shows a table with percentage of emails containing code(done manually).

**\\ Work done:**
Developed a custom web application 'Miler' with
\\ a) Systems- list of software systems loaded and to be analyzed
\\ b) Mails- no. of emails that have been read
\\ c) Retrieve any email by its id
\\ d) Email header and body
\\ e) Annotated code fragments


**
\\ Evaluation:**
Precision= fraction of retrieved lines that contain code
\\ Recall: Fraction of correct lines retrieved.
\\ Instead of emphasizing either P or R, they use a beta value for weighting of precision and recall --> which I find is really logical.
\\ Assess effectiveness of approach they also use Levenshtein distance(edit distance function) outputs the min no. of changes(lines) between the text labeled as source code in benchmark and the extracted fragments.

**
Critique:**
\\ 1. Bad- They are only using no. of Java keywords occurences or Java special characters, which means this is restricted to emails with discussion on development of projects in Java. This also means they should already have a database containing the Java keywords/special characters.
\\ 2. Good- As P and R trade off against each other, they devise an approach in which by varying the threshold, they can obtain either perfect P or perfect R 
\\ 3. Ques - How are they providing alternative views of the software system? They are just extracting source code pieces from developer emails.