Tag Archives: software plagiarism

Multiprocessing CodeSuite-MP

Until now there were two ways of running really big jobs of CodeSuite. One was to simply run it and wait for as long as it took. Really large jobs can take as much as a week or two. The other option was to run the job on CodeGrid, our framework that distributes the job over a grid of networked computers. CodeGrid shows an almost linear speedup for each computer on the grid, but it requires someone to maintain the computers and the network and that can be a daunting job. Now there’s a third option;, CodeSuite-MP allows you to run multiple jobs on a single multicore computer. We’re seeing a near-linear speedup for the number of cores, and there’s no special maintenance required. We’re even seeing a near-linear speedup using virtual cores. If you want to get a license for CodeSuite-MP, contact our sales department.

Can whitespace patterns provide clues to plagiarism?

Over the years I’ve run into expert witnesses and attorneys who have told me about software copyright infringement cases where the only clues that copying occurred were patterns of spaces and tabs (“whitespace”). The idea is that if a truly ambitious thief wanted to cover his tracks, he would modify the stolen code so much that there was no longer a visible trace of copying. However, the clever software sleuth could find patterns of whitespace that the thief had missed; although virtually nothing remained, the invisible tabs and spaces could produce a conviction.

This always sounded intriguing, but I wondered whether anyone had ever tested this theory. We could find no articles or papers on the subject, except for one inconclusive paper, and I dreaded to think that some programmer was convicted based on an untested theory. I decided to have my consulting company, Zeidman Consulting, do some carefully controlled research. If the results turned out well, SAFE Corporation would add whitespace pattern algorithms to CodeSuite to further enhance its ability to detect copying.

Our results were published in a paper entitled Measuring Whitespace Patterns as an Indication of Plagiarism that was recently presented at the ADFSL Conference on Digital Forensics, Security and Law. Our results are summarized in the final paragraph:

This whitespace pattern matching method can be used to focus a search for evidence of similarity or copying, but this method cannot stand by itself.

What we discovered is that even very different files have often have similar whitespace patterns. At Zeidman Consulting we’ve used whitespace patterns to confirm copying that was already detected through the use of CodeMatch to find correlated programming elements. In those cases, the whitespace patterns offered further confidence in our findings and in some cases showed which program had been developed first. For a copy of the paper, email us at info@SAFE-corp.biz.

Our next research project is to look at sequences of whitespace within files. Maybe there we’ll find some clues to copying. But for now our results show that whitespace patterns without any other evidence should not be used to determine that copying occurred.

DUPE: Depository of Universal Plagiarism Examples

In 2003 I created the CodeMatch program that very quickly became a de facto standard in software IP litigation. I created a test bench of purposely plagiarized code that could be used to independently and objectively compare the results produced by different plagiarism detection programs. Some in the academic community claimed that my tests were biased toward the algorithms used by CodeMatch, which explained why CodeMatch fared so well compared to the other programs. However, these same critics, despite my requests, never produced their own set of standard tests.

Although I believe that the standard tests I have used are not biased, it occurred to me that there could be a better way to eliminate even unintentional bias. The solution would be to take the source code for certain open source programs and announce a new open source project that would involve purposely plagiarizing the code. Programmers from around the world would be invited, perhaps in a competition, to change the source code while retaining the functionality. The original programs and the plagiarized versions submitted from others would be stored in a database known as the Depository of Universal Plagiarism Examples or DUPE. Plagiarism detection programs would then be run on DUPE and comparisons of the results could be made to determine which programs best detected copying. Also, important statistics about plagiarized code could be determined, as well as patterns identified in order to improve the plagiarism detection programs.

SAFE Corporation has begun looking into creating this database. However, we realize that we would like to work with partners in academia and industry. We believe that there are several key issues that need to be resolved in creating DUPE. These are:

  1. Choosing appropriate open source projects.
  2. Creating a minimum definition of software plagiarism.
  3. Creating the database.
  4. Determining policies including who can access it, how it will be used, and who will maintain it.
  5. Determining how to run the tests, how to generate the results, and how to distribute the results.

Please contact me if you’re interested in working on this important and groundbreaking project.

Interesting software IP cases of 2009

Here is my list of the most interesting software IP cases of 2009,
in chronological order:

What to look for in an expert?

I recently came across a study in the Journal of the American Academy of Psychiatry and Law out of the The University of Alabama entitled “Credibility in the Courtroom: How Likeable Should an Expert Witness Be?” To be honest, I’m not sure I understand their conclusion:

The likeability of the expert witnesses was found to be significantly related to the jurors’ perception of their trustworthiness, but not to their displays of confidence or knowledge or to the mock jurors’ sentencing decisions.

Reading the paper doesn’t make it a whole lot clearer for me, and I think their mock trial setup is a bit contrived, particularly since the jury consisted of psychology students, a demographic that you’d be unlikely to find on a real jury. Also there were only two expert witnesses for the comparison. To their credit, they discuss these potential shortcomings. I do think, however, that the paper points out something (that may have already been obvious)—there is more to being an expert witness than just being correct. Personality and presentation are strong factors.

On the other hand, I feel that this subjective aspect should be minimized. Experts need standards and measurable quantities whenever possible. Before I began developing the concept of source code correlation, the way software copyright infringement and trade secret theft cases were resolved was to have two experts give contrary opinions based on their years of experience. The judge or jury would tend to get lost in the technical details, a strategy purposely employed by some experts and attorneys, and a judgment would depend on which expert appeared more credible.

Instead, I decided to expand the field of software forensics and made it my goal to bring as much credibility to the field as DNA analysis, another very complex process that is well accepted in modern courts. I still believe that an expert’s credibility and likeability will always be factors in IP litigation, but that the emergence of source code correlation and object code correlation provide standard measures that bring a great deal of objectivity to a lawsuit’s outcome.

SAFE Corporation is looking for great ideas

There are a lot of unanswered questions about source code, and we want to work with you to figure them out. We realize that currently accepted algorithms for analyzing, comparing, and measuring source code leave a lot to be desired in many cases. Also, there are a lot of techniques that have never been studied on large bodies of modern code. For example, measurement techniques developed in the 1970s were probably tested on assembly languages and older programming languages like BASIC, FORTRAN, and COBOL. Do they still hold on modern object oriented languages like Java and C#?

If you have a research idea relating to code analysis, and you can use the SAFE tools, let us know. Email Larry Melling, VP of Sales and Marketing with your ideas. If they pass our review process you’ll get free licenses to our tools, free support, and help getting your results published. This could be the beginning of a beautiful friendship.

Software trade secrets

The precise language that legally defines a trade secret varies by jurisdiction, as do the particular types of information that are subject to trade secret protection. In the United States, different states have different trade secret laws. Most states have adopted the Uniform Trade Secrets Act, and those that don’t, have laws that only differ by subtle differences.

There are three factors that are common to all definitions; a trade secret always has these three specific characteristics:

  1. It is not generally known to the public.
  2. It confers some sort of economic benefit on its holder, where the benefit is due to the fact that it is not known to the public.
  3. The owner of the trade secret makes reasonable efforts to maintain its secrecy.

With regard to software trade secrets, algorithms that are known to the public usually cannot be trade secrets, though some jurisdictions require not only that the information be public but that it be “readily ascertainable,” meaning easily to find. For example, a sorting algorithm found in a well known textbook or in an application note on a high traffic website is, or can be, known to the public and easily ascertained.

There must be an economic benefit, so a sorting algorithm that can be easily replaced with a well-known sorting algorithm with comparable results is not a trade secret. Similarly if your company develops a program, perhaps as a side project, but does not sell it or incorporate it in any products, then it’s not a trade secret.

If the owner of the source code allows programmers to share code, or does not put notices of confidentiality in the source code, or does not take reasonable steps to insure that employees do not take the code home with them, then that source code cannot be a trade secret. This third point is a particularly important reason to take precautions to ensure your software does not go somewhere it shouldn’t. Make sure your employees, investors, and partners sign nondisclosure agreements (NDAs). Make sure you have written policies about how to handle source code. And make sure you treat all individuals and companies equally. You don’t want to be in court, defending a trade secret, and have to explain why one “trusted employee” or “trusted friend” was allowed to take home source code while others were not. That doesn’t look like “reasonable efforts to maintain secrecy.”

When is reverse engineering OK?

Copyrights protect expressions of ideas, but not the ideas themselves. Anyone can write about two young lovers from different families and different backgrounds and not fear getting sued by the estates of William Shakespeare or Arthur Laurents or anyone who writes daytime TV movies. It is for this reason, that software can be reverse engineered to learn the ideas it embodies without violating the copyright, as long as the code is not copied and used commercially. The first lawsuit verdict that enforced this idea was Atari Games v. Nintendo in September 1992.

Nintendo tightly controlled access to its successful NES video game system and did not release the specifications for creating a game cartridge for the system. In order to produce a game for the system, companies had to pay a license fee to Nintendo and had to agree not to produce the licensed game for any other game system for two years. Incorporated into the Nintendo NES system was a computer program called 10NES that checked whether a particular game had been licensed. If not, the game was not allowed to run. Atari reverse engineered the 10NES program and created its own program called Rabbit for bypassing 10NES. Atari sued Nintendo for, among other things, unfair competitions and monopolistic practices. Nintendo countersued for, among other things, copyright infringement. The U.S. Court of Appeals ruled that the reverse engineering was perfectly legal. It also ruled that Atari infringed on Nintendo’s copyright when Atari created its own program based on Nintendo’s program. The decision by Judge Randall Rader reads as follows:

The district court assumed that reverse engineering (intermediate copying) was copyright infringement… This court disagrees. Atari did not violate Nintendo’s copyright by deprocessing computer chips in Atari’s rightful possession. Atari could lawfully deprocess Nintendo’s 10NES chips to learn their unprotected ideas and processes. This fair use did not give Atari more than the right to understand the 10NES program and to distinguish the protected from the unprotected elements of the 10NES program. Any copying beyond that necessary to understand the 10NES program was infringement. Atari could not use reverse engineering as an excuse to exploit commercially or otherwise misappropriate protected expression.

Key points about software copyrights

First, a copyright exists at the moment of creation. In other words, a work does not need to be published to have a copyright. The copyright does not need to be registered with the U.S. Copyright office. It is simply a right given to the person who created the work. The advantage of registering a copyright with the government is that you then have an official document proving your ownership, making it easier to win in court against someone who attempts to use your creation without your permission. Registration can be done any time after the work is created, but is required in order to initiate litigation. Winning a copyright infringement case in court, when the copyright is registered before the infringement took place or within 3 months of the publication of the work, can entitle you to get back your attorney fees as well as “statutory damages,” which essentially constitute financial punishment that is not based on the amount of money lost by the author due to the infringement. This is done to encourage people to register their copyrights and to deter people from stealing them.

As the owner of a copyright, you have the right to reproduce the work, enhance the work, distribute the work, and perform it or display it in public.

With software, the copyright gives protection to the source code and the binary code generated from the source code. In order to register a copyright, it is normally necessary to file a copy of the intellectual property being protected with the US Copyright Office as proof. Since most software contains valuable trade secrets (which we discuss in a later section) that would lose their value if presented to the public, the copyright office allows software source code to be submitted with major sections “redacted” or left out. In fact, only the first 25 and last 25 printed pages of source code need to be submitted, though there are no guidelines as to what constitutes “first” and “last” in something consisting of many independent files and a complex interconnect of routines.

Note that a copyright notice is not required in the code, except for registering the copyright.

From correlation to copying

You have the source code from two different programs. You run them through CodeMatch and find high correlation numbers. Have you proven copying? Not yet. There are still a few steps to go through first. Finding a correlation between the source code files for two different programs doesn’t necessarily mean that illicit behavior occurred. At SAFE we’ve determined that there are exactly six reasons for correlation between two different programs. These reasons can be summarized as follows.

  • Third-Party Source Code. Both programs use open source
    code or purchased libraries.
  • Code Generation Tools. Automatic code generation tools,
    such as Microsoft Visual Basic or Adobe Dreamweaver, generate
    software source code that looks very similar.
  • Common Identifier Names. Certain identifier names are
    commonly taught in schools or commonly used by programmers in
    certain industries.
  • Common Algorithms. There may be an easy or well-understood
    way of writing a particular algorithm that most programmers use,
    or one that was taught in school or in textbooks.
  • Common Author. One programmer, or “author,”
    will create two programs that have correlation simply because
    that programmer tends to write code in a certain way.
  • Copied Code. Code was copied from one program to another.
    If the copying was not authorized by the original owner, then
    it comprises plagiarism.

It’s important when using CodeMatch to understand these rules. Especially in litigation. Before there can be proof of copyright infringement, all of the other 5 reasons for correlation need to be eliminated. CodeSuite offers some sophisticated filtering functions that allow you to filter out aspects of the code that are correlated due to the other 5 reasons. What’s left, after filtering, is correlation due to copying.

You can read more about this in the article in IP Today entitled, What, Exactly, Is Software Plagiarism?