Tag Archives: source code

DUPE: Depository of Universal Plagiarism Examples

In 2003 I created the CodeMatch program that very quickly became a de facto standard in software IP litigation. I created a test bench of purposely plagiarized code that could be used to independently and objectively compare the results produced by different plagiarism detection programs. Some in the academic community claimed that my tests were biased toward the algorithms used by CodeMatch, which explained why CodeMatch fared so well compared to the other programs. However, these same critics, despite my requests, never produced their own set of standard tests.

Although I believe that the standard tests I have used are not biased, it occurred to me that there could be a better way to eliminate even unintentional bias. The solution would be to take the source code for certain open source programs and announce a new open source project that would involve purposely plagiarizing the code. Programmers from around the world would be invited, perhaps in a competition, to change the source code while retaining the functionality. The original programs and the plagiarized versions submitted from others would be stored in a database known as the Depository of Universal Plagiarism Examples or DUPE. Plagiarism detection programs would then be run on DUPE and comparisons of the results could be made to determine which programs best detected copying. Also, important statistics about plagiarized code could be determined, as well as patterns identified in order to improve the plagiarism detection programs.

SAFE Corporation has begun looking into creating this database. However, we realize that we would like to work with partners in academia and industry. We believe that there are several key issues that need to be resolved in creating DUPE. These are:

  1. Choosing appropriate open source projects.
  2. Creating a minimum definition of software plagiarism.
  3. Creating the database.
  4. Determining policies including who can access it, how it will be used, and who will maintain it.
  5. Determining how to run the tests, how to generate the results, and how to distribute the results.

Please contact me if you’re interested in working on this important and groundbreaking project.

Interesting software IP cases of 2009

Here is my list of the most interesting software IP cases of 2009,
in chronological order:

SAFE Corporation is looking for great ideas

There are a lot of unanswered questions about source code, and we want to work with you to figure them out. We realize that currently accepted algorithms for analyzing, comparing, and measuring source code leave a lot to be desired in many cases. Also, there are a lot of techniques that have never been studied on large bodies of modern code. For example, measurement techniques developed in the 1970s were probably tested on assembly languages and older programming languages like BASIC, FORTRAN, and COBOL. Do they still hold on modern object oriented languages like Java and C#?

If you have a research idea relating to code analysis, and you can use the SAFE tools, let us know. Email Larry Melling, VP of Sales and Marketing with your ideas. If they pass our review process you’ll get free licenses to our tools, free support, and help getting your results published. This could be the beginning of a beautiful friendship.

Software trade secrets

The precise language that legally defines a trade secret varies by jurisdiction, as do the particular types of information that are subject to trade secret protection. In the United States, different states have different trade secret laws. Most states have adopted the Uniform Trade Secrets Act, and those that don’t, have laws that only differ by subtle differences.

There are three factors that are common to all definitions; a trade secret always has these three specific characteristics:

  1. It is not generally known to the public.
  2. It confers some sort of economic benefit on its holder, where the benefit is due to the fact that it is not known to the public.
  3. The owner of the trade secret makes reasonable efforts to maintain its secrecy.

With regard to software trade secrets, algorithms that are known to the public usually cannot be trade secrets, though some jurisdictions require not only that the information be public but that it be “readily ascertainable,” meaning easily to find. For example, a sorting algorithm found in a well known textbook or in an application note on a high traffic website is, or can be, known to the public and easily ascertained.

There must be an economic benefit, so a sorting algorithm that can be easily replaced with a well-known sorting algorithm with comparable results is not a trade secret. Similarly if your company develops a program, perhaps as a side project, but does not sell it or incorporate it in any products, then it’s not a trade secret.

If the owner of the source code allows programmers to share code, or does not put notices of confidentiality in the source code, or does not take reasonable steps to insure that employees do not take the code home with them, then that source code cannot be a trade secret. This third point is a particularly important reason to take precautions to ensure your software does not go somewhere it shouldn’t. Make sure your employees, investors, and partners sign nondisclosure agreements (NDAs). Make sure you have written policies about how to handle source code. And make sure you treat all individuals and companies equally. You don’t want to be in court, defending a trade secret, and have to explain why one “trusted employee” or “trusted friend” was allowed to take home source code while others were not. That doesn’t look like “reasonable efforts to maintain secrecy.”

When is reverse engineering OK?

Copyrights protect expressions of ideas, but not the ideas themselves. Anyone can write about two young lovers from different families and different backgrounds and not fear getting sued by the estates of William Shakespeare or Arthur Laurents or anyone who writes daytime TV movies. It is for this reason, that software can be reverse engineered to learn the ideas it embodies without violating the copyright, as long as the code is not copied and used commercially. The first lawsuit verdict that enforced this idea was Atari Games v. Nintendo in September 1992.

Nintendo tightly controlled access to its successful NES video game system and did not release the specifications for creating a game cartridge for the system. In order to produce a game for the system, companies had to pay a license fee to Nintendo and had to agree not to produce the licensed game for any other game system for two years. Incorporated into the Nintendo NES system was a computer program called 10NES that checked whether a particular game had been licensed. If not, the game was not allowed to run. Atari reverse engineered the 10NES program and created its own program called Rabbit for bypassing 10NES. Atari sued Nintendo for, among other things, unfair competitions and monopolistic practices. Nintendo countersued for, among other things, copyright infringement. The U.S. Court of Appeals ruled that the reverse engineering was perfectly legal. It also ruled that Atari infringed on Nintendo’s copyright when Atari created its own program based on Nintendo’s program. The decision by Judge Randall Rader reads as follows:

The district court assumed that reverse engineering (intermediate copying) was copyright infringementā€¦ This court disagrees. Atari did not violate Nintendo’s copyright by deprocessing computer chips in Atari’s rightful possession. Atari could lawfully deprocess Nintendo’s 10NES chips to learn their unprotected ideas and processes. This fair use did not give Atari more than the right to understand the 10NES program and to distinguish the protected from the unprotected elements of the 10NES program. Any copying beyond that necessary to understand the 10NES program was infringement. Atari could not use reverse engineering as an excuse to exploit commercially or otherwise misappropriate protected expression.

Key points about software copyrights

First, a copyright exists at the moment of creation. In other words, a work does not need to be published to have a copyright. The copyright does not need to be registered with the U.S. Copyright office. It is simply a right given to the person who created the work. The advantage of registering a copyright with the government is that you then have an official document proving your ownership, making it easier to win in court against someone who attempts to use your creation without your permission. Registration can be done any time after the work is created, but is required in order to initiate litigation. Winning a copyright infringement case in court, when the copyright is registered before the infringement took place or within 3 months of the publication of the work, can entitle you to get back your attorney fees as well as “statutory damages,” which essentially constitute financial punishment that is not based on the amount of money lost by the author due to the infringement. This is done to encourage people to register their copyrights and to deter people from stealing them.

As the owner of a copyright, you have the right to reproduce the work, enhance the work, distribute the work, and perform it or display it in public.

With software, the copyright gives protection to the source code and the binary code generated from the source code. In order to register a copyright, it is normally necessary to file a copy of the intellectual property being protected with the US Copyright Office as proof. Since most software contains valuable trade secrets (which we discuss in a later section) that would lose their value if presented to the public, the copyright office allows software source code to be submitted with major sections “redacted” or left out. In fact, only the first 25 and last 25 printed pages of source code need to be submitted, though there are no guidelines as to what constitutes “first” and “last” in something consisting of many independent files and a complex interconnect of routines.

Note that a copyright notice is not required in the code, except for registering the copyright.

How much is your software worth?

My consulting company Zeidman Consulting worked on a large tax case last year. For reasons involving the labyrinthine regulations of the IRS, it was important to understand how much of the IP of a software program had changed from the time it was first developed ten years ago, through subsequent revisions, until the current version. In the current version, IP remaining from the first version was taxed at one rate while IP added subsequently was taxed at a different rate (this is a simplification based on my limited understanding of tax law). There was a lot of money at stake.

Previous methods of measuring code involve counting lines of code. However, that’s a very poor estimate. Consider an example where an entire function consisting of 10,000 lines of code is replaced with a more efficient function requiring only 9,000 lines of code. Simply counting lines would tell you that there was a net reduction of 1,000 lines of code, which could incorrectly be interpreted as a reduction in IP. We realized that we could use CodeDiff and FileCount to compare lines of code to find the number of lines of code that continue from one version to another, the number of lines of code that are changed, and the number of lines of code that are added. Plugging these values into a well-defined spreadsheet allow you to graph this measure of changing lines of code (“CLOC”) over time. The actual valuation of the initial version of the software is a complex process better left to financial analysts, but the CLOC method provides a great way to measure the changes in value.

You can read more about CLOC in the article by Nik Baer and me in Intellectual Property Today entitled Measuring Changes in Software IP including a measurement of the Mozilla Firefox open source project.