Principles to standardize Performance Metrics: the missing element in AI systems applied to patents

Highlights

•Artificial Intelligence is addressing patent complexity

•Monetary, regulatory and judiciary stakes are high

•Current players provide inconsistent (and often biased) measures to prove relevance

•This paper provides recommendations for Performance Metrics to promote an open and fair decision-making environment and save costs

Patents are complex objects immersed in an ever-evolving judicial environment. They represent high strategic stakes for companies such as increasing the ROI of R&D or entering new markets.

It is only natural that Artificial Intelligence (AI) is tackling this challenge. A variety of players (such as universities, start-ups, but also in-house teams, incumbents, and AI behemoths) are applying Machine Learning (ML) techniques to complex patent-related problems such as:

Drafting patents,

Identifying validity concerns,

Assessing essentiality to standard specifications,

Infringement, assessing patent value…not to mention inventing

Each of those matters has multi-billion dollar implications.

The results and reactions so far are wide-ranging - from worship to skepticism through laughter and disdain.

Yet, money continues to be poured into these projects and countries as diverse as China, Estonia and Canada are experimenting with AI-aided dispute resolutions.

This paper provides a set of key principles to help parties, courts, AI providers and their investors assess performance in order to deliver an open and fair decision-making environment. This approach can ultimately gain traction within private and public dispute resolution mechanisms and save billions in market-clearing overheads.

Exponential technology, exponential results

AI tools find correlations that the human brain struggles to make. It does this by harnessing enormous amounts of computing power and data. Of course, the support of human expertise is always needed, to design, train and validate AI tools. However, there is no reason to believe that the patent world will resist AI any longer than more complex fields such as neuroscience or autonomous driving.

Problems in the patent world

Beyond the usual limits of AI such as incomplete or biased data, overfitting, and underfitting, AI in the patent world comes with additional concerns:

Scarcity of data

Invalidation proceedings such as PTAB provide only a handful of data points - around 15000 AIA petitions to date for 3.3 million granted US patents. For reference, translation engines are trained with more than millions of data points.

Most transactional data sit behind confidentiality walls

High costs of human validation

The high cost of human validation - prices run from USD1K to 50k for a patent analysis bring constraints to validating results

An English language bias

Although high stakes disputes take place in multiple jurisdictions, and China outnumbers the US in patent court cases, most tools today simply ignore Chinese data points. Admittedly, the progress of machine translation and fairly accessible Chinese cases may well mitigate this problem.

AI systems as black boxes

Machine Learning (ML) engineers and scientists use extremely complex and ever-evolving tools. They even struggle to understand themselves why some methods work better than others. They also run the risk to lose their audience with opaque names and odd acronyms.

Of course, basic transparency and explainability are needed to fulfill coming regulations (Cf. EU proposed AI Regulations). But a thorough understanding of respective AI approaches is of limited interest and divert from the only valuable objective: performance.

For example: Do you care whether the latest Tesla uses lithium-iron-phosphate chemistry rather than nickel-cobalt-aluminum? Likely not. You will only focus on Real Range (km) and, maybe, Efficiency (Wh/km).

Vendors usually single out one parameter (the best one) to market how good they are. The user doesn’t have a fair and consistent method to assess the relative performance of their tools.

Hence, the only rational approach to discuss AI tools without going under the hood of AI, is to have a set of transparent performance metrics that can be easily understood and applied.

Performance Metrics – using a Principles-based approach

Performance metrics quantify the accuracy of the results against known results.

The cost and the variability of human validation is such that performance can only be measured by automated testing. Avoiding confusion between training and user testing is important. AI systems learn, delearn and relearn. Training is a circular process and every system re-informs its input data by updating sources (such as new decisions) or extending to new sources (such as NPL).

Testing -although carried out in-house by system providers- should be independently conducted as well to measure performance.

In the field of AI as applied to patents, testing answers simple questions: Is this patent likely to be invalidated? Is this patent essential to a standard? Is this patent infringed by a particular product?

Testing is an art and should focus on the following principles:

Principle 1: Open, fair and transparent

Goal: Fulfill the core requirements of open justice, procedural fairness, and impartiality.

How: Ensure the testing process is transparent and consistently applied. After being used, testing data sets are shared for cross-checking, peer review and even crowd wisdom.

Principle 2: Non-correlation with training data

Goal: Ensure relevancy of measures: if the testing sets are correlated or overlap with the training set, metrics are erroneous.

How: Buy or build unique testing sets. Size matters but a well-designed set with just 100 data points can reveal appalling performance. Don't reuse the same testing sets to avoid players from gaming the system (remember patentees auto-citing their own patents when forward citations were used in most patent quality measures?)

Principle 3: Correlation to the target environment

Goal: Ensure Performance metrics apply to the target environment (think pharma vs. ITC patents for example).

How: Make sure testing sets are close to the field of use.

Principle 4: Simplicity

Goal: Ensure a straightforward understanding by non-AI experts

Examples: Recall, Precision or their arithmetical mean Fbeta are gaining acceptance (See Appendix)

In other terms, if you choose Real Range and Efficiency to assess your next electric car, make a driving test according to your environment (city commute vs. long haul for example) and measure real-life results.

Conclusion

Experts, as well as lawyers and judges, will always be highly valuable. But let’s recognize that the complexity of patents added to the complexity of legal proceedings have a huge transactional cost (probably north of 50b USD worldwide). This hinders the liquidity of the market of innovations and the quick adoption of standards.

Establishing principles-based performance metrics can mitigate known weaknesses of AI, promote wide acceptance and ultimately lower market frictions.

Immediate Action items

•As a user or investor of AI systems, make sure you have simple performance metrics (my preference goes to F1 score) based on an independent data set

•As a developer of AI systems, welcome the above approach (and get the results to retrain)

Pascal Asselot is the Managing Partner of Vulnerant, an IP advisory boutique delivering unconventional strategies to leverage patents, AI and capital sources. Pascal has 18 years of experience in patent licensing discussions (both amicable and less amicable) and is a WIPO-certified mediator developing fair testing methodologies and advocating data-driven dispute resolution mechanisms to solve complex multijurisdictional patent disputes.

AppenDIX

Going (a little) deeper – Technical considerations

No alt text provided for this image

Alongside AI systems, the industry has developed techniques to score the performance of systems. Common measures include:

Precision: the ratio of True Positive over Retrieved Positives. This answers the question “How many retrieved items are relevant?” or in validity terms: "How many patents identified as valid are actually valid?"; in SEP terms: “How many identified SEP are actually SEP?”

Recall: the ratio of True Positives over all Positives. This answers the question “How many relevant items were retrieved?” or in validity terms: "How many patents have been identified as valid among the valid patents (of the testing set)?; "in SEP terms: “How many patents have been identified as essential among the essential patents (of the testing set)?”

The Fbeta score: the arithmetical mean between Precision and Recall weighted by beta (typically for values between 0,5 and 2).

Most current predictions (likelihood to be essential to a specific Technical Specification of a standard, likelihood to survive an invalidation proceeding) are made by a score with 1 to 100, or Low/Med/High ranking. Precision and Recall metrics require binary outcomes (True or False). A number of techniques that let the threshold vary can be used and achieve simple metrics. These methods, although strangely named (Area Under the Curve, Receiver Operating Characteristic…), are simple and widely accepted.

Examples

Comparing the accuracy of two Validity Score engines based on patent geography

No alt text provided for this image

Where to find testing data - some practical ideas

Validity. Prosecution data is available at patent offices around the world and presents a wealth of data points on prior art.

Standard Essentiality. Beyond patents recognized as essential by courts, try asserted SEP - likely more essential (and valid) than the usual crowd. Not in high volume, but still.

Infringement. Check virtual markings in the relevant industry. Follow the interesting project by EPFL professors [2].

Value. Check your own historical transaction data points. Paid databases aggregating information from public records (e.g. SEC) are available.

Quality. No idea what it covers. Infringement, validity, essentiality, monetary value can be decided by a legal proceeding, therefore predicted. Quality cannot.