In 2003 I created the CodeMatch program that very quickly became a de facto standard in software IP litigation. I created a test bench of purposely plagiarized code that could be used to independently and objectively compare the results produced by different plagiarism detection programs. Some in the academic community claimed that my tests were biased toward the algorithms used by CodeMatch, which explained why CodeMatch fared so well compared to the other programs. However, these same critics, despite my requests, never produced their own set of standard tests.
Although I believe that the standard tests I have used are not biased, it occurred to me that there could be a better way to eliminate even unintentional bias. The solution would be to take the source code for certain open source programs and announce a new open source project that would involve purposely plagiarizing the code. Programmers from around the world would be invited, perhaps in a competition, to change the source code while retaining the functionality. The original programs and the plagiarized versions submitted from others would be stored in a database known as the Depository of Universal Plagiarism Examples or DUPE. Plagiarism detection programs would then be run on DUPE and comparisons of the results could be made to determine which programs best detected copying. Also, important statistics about plagiarized code could be determined, as well as patterns identified in order to improve the plagiarism detection programs.
SAFE Corporation has begun looking into creating this database. However, we realize that we would like to work with partners in academia and industry. We believe that there are several key issues that need to be resolved in creating DUPE. These are:
- Choosing appropriate open source projects.
- Creating a minimum definition of software plagiarism.
- Creating the database.
- Determining policies including who can access it, how it will be used, and who will maintain it.
- Determining how to run the tests, how to generate the results, and how to distribute the results.
Please contact me if you’re interested in working on this important and groundbreaking project.
I recently came across a study in the Journal of the American Academy of Psychiatry and Law out of the The University of Alabama entitled “Credibility in the Courtroom: How Likeable Should an Expert Witness Be?” To be honest, I’m not sure I understand their conclusion:
The likeability of the expert witnesses was found to be significantly related to the jurors’ perception of their trustworthiness, but not to their displays of confidence or knowledge or to the mock jurors’ sentencing decisions.
Reading the paper doesn’t make it a whole lot clearer for me, and I think their mock trial setup is a bit contrived, particularly since the jury consisted of psychology students, a demographic that you’d be unlikely to find on a real jury. Also there were only two expert witnesses for the comparison. To their credit, they discuss these potential shortcomings. I do think, however, that the paper points out something (that may have already been obvious)—there is more to being an expert witness than just being correct. Personality and presentation are strong factors.
On the other hand, I feel that this subjective aspect should be minimized. Experts need standards and measurable quantities whenever possible. Before I began developing the concept of source code correlation, the way software copyright infringement and trade secret theft cases were resolved was to have two experts give contrary opinions based on their years of experience. The judge or jury would tend to get lost in the technical details, a strategy purposely employed by some experts and attorneys, and a judgment would depend on which expert appeared more credible.
Instead, I decided to expand the field of software forensics and made it my goal to bring as much credibility to the field as DNA analysis, another very complex process that is well accepted in modern courts. I still believe that an expert’s credibility and likeability will always be factors in IP litigation, but that the emergence of source code correlation and object code correlation provide standard measures that bring a great deal of objectivity to a lawsuit’s outcome.
There are a lot of unanswered questions about source code, and we want to work with you to figure them out. We realize that currently accepted algorithms for analyzing, comparing, and measuring source code leave a lot to be desired in many cases. Also, there are a lot of techniques that have never been studied on large bodies of modern code. For example, measurement techniques developed in the 1970s were probably tested on assembly languages and older programming languages like BASIC, FORTRAN, and COBOL. Do they still hold on modern object oriented languages like Java and C#?
If you have a research idea relating to code analysis, and you can use the SAFE tools, let us know. Email Larry Melling, VP of Sales and Marketing with your ideas. If they pass our review process you’ll get free licenses to our tools, free support, and help getting your results published. This could be the beginning of a beautiful friendship.
Copyrights protect expressions of ideas, but not the ideas themselves. Anyone can write about two young lovers from different families and different backgrounds and not fear getting sued by the estates of William Shakespeare or Arthur Laurents or anyone who writes daytime TV movies. It is for this reason, that software can be reverse engineered to learn the ideas it embodies without violating the copyright, as long as the code is not copied and used commercially. The first lawsuit verdict that enforced this idea was Atari Games v. Nintendo in September 1992.
Nintendo tightly controlled access to its successful NES video game system and did not release the specifications for creating a game cartridge for the system. In order to produce a game for the system, companies had to pay a license fee to Nintendo and had to agree not to produce the licensed game for any other game system for two years. Incorporated into the Nintendo NES system was a computer program called 10NES that checked whether a particular game had been licensed. If not, the game was not allowed to run. Atari reverse engineered the 10NES program and created its own program called Rabbit for bypassing 10NES. Atari sued Nintendo for, among other things, unfair competitions and monopolistic practices. Nintendo countersued for, among other things, copyright infringement. The U.S. Court of Appeals ruled that the reverse engineering was perfectly legal. It also ruled that Atari infringed on Nintendo’s copyright when Atari created its own program based on Nintendo’s program. The decision by Judge Randall Rader reads as follows:
The district court assumed that reverse engineering (intermediate copying) was copyright infringement… This court disagrees. Atari did not violate Nintendo’s copyright by deprocessing computer chips in Atari’s rightful possession. Atari could lawfully deprocess Nintendo’s 10NES chips to learn their unprotected ideas and processes. This fair use did not give Atari more than the right to understand the 10NES program and to distinguish the protected from the unprotected elements of the 10NES program. Any copying beyond that necessary to understand the 10NES program was infringement. Atari could not use reverse engineering as an excuse to exploit commercially or otherwise misappropriate protected expression.
You have the source code from two different programs. You run them through CodeMatch and find high correlation numbers. Have you proven copying? Not yet. There are still a few steps to go through first. Finding a correlation between the source code files for two different programs doesn’t necessarily mean that illicit behavior occurred. At SAFE we’ve determined that there are exactly six reasons for correlation between two different programs. These reasons can be summarized as follows.
- Third-Party Source Code. Both programs use open source
code or purchased libraries.
- Code Generation Tools. Automatic code generation tools,
such as Microsoft Visual Basic or Adobe Dreamweaver, generate
software source code that looks very similar.
- Common Identifier Names. Certain identifier names are
commonly taught in schools or commonly used by programmers in
- Common Algorithms. There may be an easy or well-understood
way of writing a particular algorithm that most programmers use,
or one that was taught in school or in textbooks.
- Common Author. One programmer, or “author,”
will create two programs that have correlation simply because
that programmer tends to write code in a certain way.
- Copied Code. Code was copied from one program to another.
If the copying was not authorized by the original owner, then
it comprises plagiarism.
It’s important when using CodeMatch to understand these rules. Especially in litigation. Before there can be proof of copyright infringement, all of the other 5 reasons for correlation need to be eliminated. CodeSuite offers some sophisticated filtering functions that allow you to filter out aspects of the code that are correlated due to the other 5 reasons. What’s left, after filtering, is correlation due to copying.
You can read more about this in the article in IP Today entitled, What, Exactly, Is Software Plagiarism?
SAFE has just released a new tool for comparing computer code to detect copyright infringement and trade secret theft. CodeCross™ finds traces of nonfunctional source code that have been copied from one program to another.
According to Nikolaus Baer at Zeidman Consulting, a SAFE Corporation customer, “I suggested the concept of CodeCross after working on cases where stolen code had nonfunctional remnants in another party’s code. SAFE developed the tool quickly and it works great. In one case I found traces of copied code that had previously gone undetected.”
CodeCross is available with CodeSuite 3.2.0 and can be downloaded for free from the SAFE Corporation website.
Welcome to my blog. I’m the founder and president of SAFE Corporation. I’ll be regularly updating this blog with useful and interesting information about software intellectual property and software analysis. I’ll be posting facts and current events relating to IP litigation, software plagiarism detection (also known as copyright infringement), trade secret theft, and patent infringement. At SAFE Corporation we’re developing unique ground-breaking tools for analyzing, measuring, and comparing software — of course I’ll be updating you on all the cool new tools.
So stay tuned and feel free to comment.