Software Analysis and Forensic Engineering has just released a new version of CodeSuite that has some really great new features.
What’s a PID? It’s a partial identifier. Or more specifically, a partially matching identifier. That’s where two identifiers in code almost match. So for example, the identifiers identifier1 and confident_boy share the partial identifier (or “PID”) ident. CodeMatch has always been able to correlate PIDs and use that in calculating the identifier correlation score as a component of the entire correlation score between two source code files. But there can be so many PIDs that users got blurry-eyed trying to view them all and find suspicious ones in a CodeMatch HTML report. So we came up with a solution. You can now export the PIDs from a CodeSuite database into a spreadsheet. You can see not only the PIDs, but the original identifiers that share the PIDs. Now you can sort and select, cut and paste, and generally look for clues to copying in a simple spreadsheet.
Part of our process for finding copying has been to first find all the source code files in a directory of files so that you know what to examine. However, there are lots of source code files, and some can be missed. Some programming languages are a bit uncommon and you may not recognize the source code files. Well, we found a solution to that too. The new FileIdentify function of CodeSuite allows you to point at a folder and generate a spreadsheet containing all of the file extensions in that folder and all subfolders. If CodeSuite recognizes the (potential) programming language, it will put that information in the spreadsheet too.
From the beginning of CodeSuite, when there was only CodeMatch, the database has always been a fully documented text file that anyone can view. This allows our customers to make their own tools to extract data and statistics from a CodeSuite comparison, and some customers have created some very interesting utilities. Our database format was simple, but grew more complex over the years. Now we have a function in CodeSuite that converts any CodeSuite database into XML so that you can use off-the-shelf tools to examine it, translate it, or write utilities to extract data and statistics.
I hope you’re all aware of my book The Software IP Detective’s Handbook: Measurement, Comparison, and Infringement Detection. It’s the first book on Software Forensics, a field that I pioneered at Software Analysis and Forensic Engineering and Zeidman Consulting. Whereas Digital Forensics deals with bits and files, without any detailed knowledge of the meaning of the data, Software Forensics deals with analysis of software using detailed knowledge of its syntax and functionality to perform analysis to find stolen code and stolen trade secrets. The algorithms described in the book have been used in many court cases. The book also describes algorithms for measuring software evolution, particularly as it relates to IP changes.
If you are a teacher, this is a great time to incorporate the materials in the book into your courses on software development, intellectual property law, business management, and computer science. There’s something for everyone in the various chapters of the book. Your students and you will be at the forefront of an important and very new field of study.
If you’re interested, please contact me.
My book on software intellectual property, a labor of love (and hate) for the last two years, has just been published by Prentice-Hall. The book is intended for several different audiences including computer scientists, computer programmers, business managers, lawyers, engineering consultants, expert witnesses, and high-tech entrepreneurs. Some chapters give easy-to-understand explanations of intellectual property concepts including copyrights, patents, and trade secrets. Other chapters are highly mathematical treatments describing quantitative ways of comparing and measuring software and software IP. The first chapter of the book outlines which chapters are most important for the different audiences.
Overall the book covers the following topics:
- Key concepts of software intellectual property
- Comparing and correlating source code for signs of theft or infringement
- Uncovering signs of copying in object code when source code is inaccessible
- Tracking malware and third-party code in applications
- Using software clean rooms to avoid IP infringement
- Understanding IP issues associated with patents, open source, and DMCA
You can purchase your copy from Amazon.com here.
You can now run CodeMeasure to graph the growth of your software project development effort over multiple versions of the software. CodeMeasure uses the Changing Lines of Code (CLOC) method to calculate the growth. The graph that CodeMeasure produces illustrates various CLOC measurements. An example is shown below.
Now there is a caveat (we do need to make a profit after all). You can examine the graph and take a screen shot of it, but you can’t save the results to a spreadsheet without a paid license. The good news is that a license is only $500 for a 1-year unlimited license. You can download CodeMeasure here and purchase a license here. This way you get to try out CodeMeasure and see how the results can help you measure your software development effort.
So the government is finding ways to fix the patent system. One of those fixes is the Peer-to-Patent program. It seems like a good idea. In order to speed up the granting of good patents and quickly eliminate the bad ones, allow people from everywhere and anywhere to submit prior art. If that’s actually the way it worked, I’d celebrate; it would be a great resource for finding prior art and making the patent office more efficient. Unfortunately my experience is that the program creates more problems than it fixes. The patent office invited me to participate in the program. Two people posted “invalidating prior art” for my patent application entitled “Detecting Plagiarism in Computer Source Code.” This art was related to my invention, but definitely was not invalidating. Here is the first independent claim of my original patent application:
- A computer-implemented method comprising:
- creating, by a computer system, a first array of lines of functional program code from a first program source code file, the first program source code file including the lines of functional program code of a first program and lines of nonfunctional comments of the first program;
- creating, by the computer system, a second array of lines of nonfunctional comments from a second program source code file, the second program source code file including lines of functional program code of a second program and the lines of nonfunctional comments of the second program;
- comparing, by the computer system, the lines of functional program code from the first array with the lines of nonfunctional comments from the second array to find similar lines;
- calculating, by the computer system, a similarity number based on the similar lines; and presenting to a user an indication of copying of the first program source code file wherein said indication of copying is defined by the similarity number.
Here is the only dependent claim of the prior art patent US 7,568,109:
- A system for comparing at least a first corpus to a second corpus, comprising:
- an analyzer identifying concepts in the corpuses, said analyzer determining a frequency rating of each of said concepts in each corpus;
- for each corpus, replacing each instance of each of said concepts with its respective determined frequency rating to create a frequency file;
- and a comparator comparing the frequency file for the first corpus to the frequency file for the second corpus, wherein said comparing the frequency file for the first corpus to the frequency file for the second corpus further comprises comparing portions of one corpus against the other corpus.
The second prior art submission was simply a reference to the UNIX diff command. While the diff command is relevant, it is a simple line-by line comparison of text files without any understanding or parsing of programming source code. It doesn’t separate functional lines of code (statements) from nonfunctional lines (comments).
Judging by their remarks, the posters to the Peer-to-Patent site didn’t understand patents, and didn’t read the patent claims. They should be allowed to post references, but the ultimate decision must be in the hands of those trained in examining patents. However, the patent examiner told me that her supervisor didn’t want to issue a patent that had been publicly noted to be invalid, and so after months of arguments I had to arbitrarily narrow the claims to get allowance, resulting in patent US 7,823,127. So now, anyone from anywhere with any ulterior motive (particularly those who believe no software should be patentable) can bring about the quick rejection of an otherwise useful and valid patent.
Last month we announced CodeMeasure, our new standalone tool for measuring software growth. This month we announced the release of CodeSuite 4.0 that includes CodeCLOC for measuring how software evolves across versions of code. CodeCLOC uses the same algorithms that were implemented in CodeMeasure and that were developed for the landmark software transfer pricing case Symantec v. Commissioner of Internal Revenue.
You’re probably wondering what is the difference between CodeMeasure and CodeCLOC. CodeMeasure is a simple, inexpensive program for generating the CLOC measurement statistics for multiple versions of a program. CodeCLOC, intended for litigation, compares only two versions of code but produces a detailed database of results that can be further filtered and analyzed using CodeSuite or your own custom tools. The results from CodeCLOC can be presented in court and the CodeCLOC database can be presented to the opposing party for verification.
CodeSuite 4.0 also has a few other nice features including a revamped user interface. There’s also a new function to generate statistics from any CodeSuite database and the command line interface has been enhanced for integrating with other programs. CodeSuite 4.0 is available for download here and can be purchased on a term license or project basis. CodeCLOC is priced at $20 per megabyte. A one year term license for CodeSuite is $100,000.
SAFE has just introduced its latest product called CodeMeasure™ that can measure the growth of software. Unlike our other products, this one is intended for software developers (look for a litigation version coming soon to CodeSuite). The tool is based on the technique that Zeidman
Consulting developed for the case Symantec v. IRS that we call the Changing Lines of Code (CLOC) method of measuring software changes. It worked pretty well in the Symantec case to help calculate software transfer pricing, and saved Symantec over $500 million in taxes.
We have a whole new website about the product, designed for software developers, at CodeMeasure.com. Check it out and let me know what you think of the product and the website.
There are a lot of unanswered questions about source code, and we want to work with you to figure them out. We realize that currently accepted algorithms for analyzing, comparing, and measuring source code leave a lot to be desired in many cases. Also, there are a lot of techniques that have never been studied on large bodies of modern code. For example, measurement techniques developed in the 1970s were probably tested on assembly languages and older programming languages like BASIC, FORTRAN, and COBOL. Do they still hold on modern object oriented languages like Java and C#?
If you have a research idea relating to code analysis, and you can use the SAFE tools, let us know. Email Larry Melling, VP of Sales and Marketing with your ideas. If they pass our review process you’ll get free licenses to our tools, free support, and help getting your results published. This could be the beginning of a beautiful friendship.
My consulting company Zeidman Consulting worked on a large tax case last year. For reasons involving the labyrinthine regulations of the IRS, it was important to understand how much of the IP of a software program had changed from the time it was first developed ten years ago, through subsequent revisions, until the current version. In the current version, IP remaining from the first version was taxed at one rate while IP added subsequently was taxed at a different rate (this is a simplification based on my limited understanding of tax law). There was a lot of money at stake.
Previous methods of measuring code involve counting lines of code. However, that’s a very poor estimate. Consider an example where an entire function consisting of 10,000 lines of code is replaced with a more efficient function requiring only 9,000 lines of code. Simply counting lines would tell you that there was a net reduction of 1,000 lines of code, which could incorrectly be interpreted as a reduction in IP. We realized that we could use CodeDiff and FileCount to compare lines of code to find the number of lines of code that continue from one version to another, the number of lines of code that are changed, and the number of lines of code that are added. Plugging these values into a well-defined spreadsheet allow you to graph this measure of changing lines of code (“CLOC”) over time. The actual valuation of the initial version of the software is a complex process better left to financial analysts, but the CLOC method provides a great way to measure the changes in value.
You can read more about CLOC in the article by Nik Baer and me in Intellectual Property Today entitled Measuring Changes in Software IP including a measurement of the Mozilla Firefox open source project.