CodeMatch Algorithms
The Algorithms
CodeMatch
uses several algorithms to determine similarity between two
source code files. These algorithms are described below. When
multiple files are compared, file pairs are given a correlation
score between 0 and 100. The file pairs are then ranked by CodeMatch
score so that you can examine the most similar files.
Statement Matching
The CodeMatch
Statement Matching algorithm compares each functional line
of source code from both files, excluding comment lines. Also,
all whitespace removed. For source lines to be considered
matches, they must contain at least one non-keyword such as
a variable name or function name. Otherwise, lines containing
basic operations would be reported as matching. This algorithm
yields a correlation score representing the number of matching
instruction lines in the two files.
Comment/String
Matching
The CodeMatch
Comment/String Matching algorithm compares each line
of comments and each string from both files. Sequences
of whitespace are converted to single spaces so that
the syntax structure of the line is preserved. The entire
comment or string is compared, regardless of whether
there are keywords in the comment or string. This algorithm
yields a correlation score representing the number of
matching comments and strings in the two files.
Instruction Sequence
Matching
The CodeMatch
Instruction Sequence Matching algorithm compares the first
instruction of every source line in the pair of files, ignoring
blank lines and comment lines. This algorithm finds sequences
of code that appear to perform the same functions despite
changed comments and identifier names. The algorithm finds
the longest common sequence within both files. This algorithm
yields a correlation score representing the number of source
lines in the longest matching instruction sequence in the
two files. Even if variable names are altered in one file,
CodeMatch will report similarities in the files. The following
shows an example of two identical instruction sequences in
C:
//
File 1
if (x == 5)
{
// Loop on j here
for (j = 0; j < Index; j++)
printf("x
= %i", j);
}
else
break; //
Here's the break
//
File 2
if (xyz < 2)
for (jjj = 0; jjj < i; jjj++)
{
printf("Hello
world\n");
}
else
break;
Identifier Matching
For each
file pair, the CodeMatch Identifier Matching algorithm counts
the number of matching identifiers that are not programming
language keywords. In order to determine whether an identifier
is a programming language keyword, comparison is done with
a list of programming language keywords. Only non-keywords
are compared in order to find matching function names, variable
names, and other identifiers. This algorithm also examines
each non-keyword identifier in the source code of one file
and finds all identifiers that match a sequence within one
or more non-keyword identifiers in the other file. For example,
if identifier abc is found in one file and the identifiers
aabc, abc1111111, and abcxxxyz are found in anther file, then
there is a partial match for the identifier abc. Partial matching
finds common function names, variable names, and other identifiers
where someone has attempted to disguise the identifiers in
a fairly basic way. This algorithm yields a correlation score
representing the number of matching and partially matching
identifiers in the two files.
Correlation Score
Finally,
a single correlation score is given for the similarity of
the file pairs. If a file pair has a higher score, it implies
that these files are more similar and may be plagiarized from
each other or from a common third file. There are many kinds
of plagiarism and many ways of fooling plagiarism detection
programs. For this reason, CodeMatch produces an HTML output
report with a list of file pairs ordered by their total match
scores. The user can click on a file pair hyperlink to bring
up a detailed HTML report showing exact matches that occurred
between the selected file pairs. In this way, experts are
directed to suspicious similarities and allowed to make their
own judgments. CodeMatch is not a tool for precisely pinpointing
plagiarized code, but rather a tool to assist an expert in
finding plagiarized code. CodeMatch reduces the effort needed
by the expert by allowing him to narrow his focus from hundreds
of thousands of lines of code in hundreds of files to dozens
of lines of code in dozens of files.
|