The history of the computer industry is filled with fascinating tales of sudden riches and lost opportunities. Take that of Ronald Wayne, who cofounded Apple Computer with Steve Wozniak and Steve Jobs but sold his shares for just US $2,300. And John Atanasoff, who proudly showed his digital computer design to John Mauchly who later codesigned the Eniac, typically recognized as the first electronic computer, without credit to Atanasoff. Perhaps the most famous story of missed fame and fortune is that of Gary Kildall. A pioneer in computer operating systems, Kildall started the company Digital Research and wrote Control Program for Microcomputers (CP/M), the operating system used on many of the early hobbyist personal computers, such as the MITS Altair 8800, the IMSAI 8080, and the Osborne 1, before IBM introduced its own PC. Kildall could have been the king of personal computer software, but instead that title went to his small-time rival Bill Gates. For years, rumors have circulated that the code for the original DOS operating system sold by Microsoft is actually copied from the CP/M operating system developed by Digital Research.
A couple years ago we took it upon ourselves to search out the original code and use CodeSuite to determine the truth once and for all. Our research was summarized in a popular (and not-so-popular) article in IEEE Spectrum entitled Did Bill Gates Steal the Heart of DOS? If you haven’t read it, you should. It’s a fun read but it only summarizes our exhaustive results using our tools and procedures for finding copied code. The article generated a lot of controversy and we always intended to publish the full technical details of our analysis, but it’s surprising how many people don’t like our conclusion and wouldn’t publish my paper. But now the full academic paper entitled A Code Correlation Comparison of the DOS and CP/M Operating Systems is available online in the Journal of Software Engineering and Applications. If you want to know the details, and you want to know the truth, it’s in the article and the details are in the paper.
I hope you’re all aware of my book The Software IP Detective’s Handbook: Measurement, Comparison, and Infringement Detection. It’s the first book on Software Forensics, a field that I pioneered at Software Analysis and Forensic Engineering and Zeidman Consulting. Whereas Digital Forensics deals with bits and files, without any detailed knowledge of the meaning of the data, Software Forensics deals with analysis of software using detailed knowledge of its syntax and functionality to perform analysis to find stolen code and stolen trade secrets. The algorithms described in the book have been used in many court cases. The book also describes algorithms for measuring software evolution, particularly as it relates to IP changes.
If you are a teacher, this is a great time to incorporate the materials in the book into your courses on software development, intellectual property law, business management, and computer science. There’s something for everyone in the various chapters of the book. Your students and you will be at the forefront of an important and very new field of study.
If you’re interested, please contact me.
S.A.F.E. recently released version 4.4 of CodeSuite and version 1.1 of CodeSuite-LT. The most important new feature of this version is that these programs now recognizes many different text encoding formats including ASCII, UTF-8, UTF-16, and UTF-32. Characters in alphabets other than the Latin alphabet used for English are now supported. For example, code with comments or strings in Japanese, Korean, Chinese, or Russian can be compared correctly.
The most significant change is to BitMatch. When examining binary object code to find text strings, you can now specify the encoding format of the file. If you’re not sure about the encoding, you can choose multiple formats.
As demand for our products increase outside the United States, we realized a need to support languages in those countries also.
CodeSuite-LT® is a less expensive, limited version of the full CodeSuite tool. Each tool in the suite produces a readable report that can be used to find copying. CodeSuite-LT includes BitMatch, CodeCross, CodeDiff, CodeMatch, FileCount, and FileIsolate. It also includes the ability to filter results using SourceDetective. CodeSuite-LT does not produce a database and does not allow post-process filtering of results. Instead, it generates an easy-to-read report that can be used to pinpoint copying.
Which is Right For You?
Which product is right for you, CodeSuite or CodeSuite-LT? Click here for a table that compares the features of both programs so you can choose the right solution.
Marc Dreier was the founding partner of the New York law firm Dreier Stein and Kahan. At its peak in 2009 the firm was reportedly bringing in about $100 million in revenue. You might have heard of Marc Dreier if he had not been overshadowed by Bernie Madoff. But Dreier ran his own Ponzi scheme, cheating clients out of “only” $400 million. Prosecutors asked for a sentence of 145 years. Defense attorney requested a 20 year sentence. He got the shorter sentence.
Maybe you’ve wondered, like I have, what’s going through the minds of these people. Do they think they won’t get caught? Do they not care? Maybe they figure they’ll live it up while they can? Dreier’s letter to the judge in the case, prior to sentencing (downloadable here) might give some insight into Dreier’s state of mind. Some say this was just a way to get sympathy from the court (in which case it appears to have worked). To me it seems sincere. It definitely doesn’t excuse Dreier’s behavior, but it does possibly explain how such a successful man could end up in his situation and what was going on in his mind.
Over the years I’ve run into expert witnesses and attorneys who have told me about software copyright infringement cases where the only clues that copying occurred were patterns of spaces and tabs (“whitespace”). The idea is that if a truly ambitious thief wanted to cover his tracks, he would modify the stolen code so much that there was no longer a visible trace of copying. However, the clever software sleuth could find patterns of whitespace that the thief had missed; although virtually nothing remained, the invisible tabs and spaces could produce a conviction.
This always sounded intriguing, but I wondered whether anyone had ever tested this theory. We could find no articles or papers on the subject, except for one inconclusive paper, and I dreaded to think that some programmer was convicted based on an untested theory. I decided to have my consulting company, Zeidman Consulting, do some carefully controlled research. If the results turned out well, SAFE Corporation would add whitespace pattern algorithms to CodeSuite to further enhance its ability to detect copying.
Our results were published in a paper entitled Measuring Whitespace Patterns as an Indication of Plagiarism that was recently presented at the ADFSL Conference on Digital Forensics, Security and Law. Our results are summarized in the final paragraph:
This whitespace pattern matching method can be used to focus a search for evidence of similarity or copying, but this method cannot stand by itself.
What we discovered is that even very different files have often have similar whitespace patterns. At Zeidman Consulting we’ve used whitespace patterns to confirm copying that was already detected through the use of CodeMatch to find correlated programming elements. In those cases, the whitespace patterns offered further confidence in our findings and in some cases showed which program had been developed first. For a copy of the paper, email us at info@SAFE-corp.biz.
Our next research project is to look at sequences of whitespace within files. Maybe there we’ll find some clues to copying. But for now our results show that whitespace patterns without any other evidence should not be used to determine that copying occurred.
Forrester Consulting just put out a report that I found interesting. According to Forrester, chief information security officers (CISOs) face increasing demands from their business units, regulators, and business partners to safeguard their information assets. Security programs protect two types of data: secrets that confer long-term competitive advantage and custodial data assets that they are compelled to protect. Secrets include product plans, earnings forecasts, and trade secrets; custodial data includes customer, medical, and payment card information that becomes “toxic” when spilled or stolen. Forrester found that enterprises are overly focused on compliance and not focused enough on protecting their secrets. Forrester’s key findings are the following:
- Secrets comprise two-thirds of the value of firms’ information portfolios.
- Compliance, not security, drives security budgets.
- Firms focus on preventing accidents, but theft is where the money is.
- The more valuable a firm’s information, the more incidents it will have.
- CISOs do not know how effective their security controls actually are.
Download the report to report to get the details.
In 2003 I created the CodeMatch program that very quickly became a de facto standard in software IP litigation. I created a test bench of purposely plagiarized code that could be used to independently and objectively compare the results produced by different plagiarism detection programs. Some in the academic community claimed that my tests were biased toward the algorithms used by CodeMatch, which explained why CodeMatch fared so well compared to the other programs. However, these same critics, despite my requests, never produced their own set of standard tests.
Although I believe that the standard tests I have used are not biased, it occurred to me that there could be a better way to eliminate even unintentional bias. The solution would be to take the source code for certain open source programs and announce a new open source project that would involve purposely plagiarizing the code. Programmers from around the world would be invited, perhaps in a competition, to change the source code while retaining the functionality. The original programs and the plagiarized versions submitted from others would be stored in a database known as the Depository of Universal Plagiarism Examples or DUPE. Plagiarism detection programs would then be run on DUPE and comparisons of the results could be made to determine which programs best detected copying. Also, important statistics about plagiarized code could be determined, as well as patterns identified in order to improve the plagiarism detection programs.
SAFE Corporation has begun looking into creating this database. However, we realize that we would like to work with partners in academia and industry. We believe that there are several key issues that need to be resolved in creating DUPE. These are:
- Choosing appropriate open source projects.
- Creating a minimum definition of software plagiarism.
- Creating the database.
- Determining policies including who can access it, how it will be used, and who will maintain it.
- Determining how to run the tests, how to generate the results, and how to distribute the results.
Please contact me if you’re interested in working on this important and groundbreaking project.
There are a lot of unanswered questions about source code, and we want to work with you to figure them out. We realize that currently accepted algorithms for analyzing, comparing, and measuring source code leave a lot to be desired in many cases. Also, there are a lot of techniques that have never been studied on large bodies of modern code. For example, measurement techniques developed in the 1970s were probably tested on assembly languages and older programming languages like BASIC, FORTRAN, and COBOL. Do they still hold on modern object oriented languages like Java and C#?
If you have a research idea relating to code analysis, and you can use the SAFE tools, let us know. Email Larry Melling, VP of Sales and Marketing with your ideas. If they pass our review process you’ll get free licenses to our tools, free support, and help getting your results published. This could be the beginning of a beautiful friendship.
Last month I talked about a report from McAfee, Inc. that discussed the huge amount of intellectual property that gets stolen from companies. A new report from the Ponemon Institute confirms this data. According to this report, more than half of workers that are let go from their employers take confidential data and intellectual property with them as they head out the door.
Here are some interesting statistics from the report (we all love statistics):
- 945 individuals who were laid off, fired or quit their jobs in the past 12 months were surveyed.
- 59% admitted to stealing company data.
- 67% used their former company’s confidential information to help get a new job.
- 61% of respondents who disliked their company took data.
- 26% of those who liked their company still took data.
- 79% of those who took data rationalized it rather than call it wrong.
- 24% claimed to still have access to their former employer’s computers after they left.
For more information you can read the Network World article.