Noise Filtering

Abstract

Analyzing patents related to a technology domain is not as easy as it seems. Generally, the standard method for analyzing a technology domain is to formulate multiple search queries to extract patent data sets and then filter the patents manually. The filtration of the result set to remove noise is very critical to ensuring accurate analysis. Moreover, this requires a lot of manual effort and consumes a significant amount of time. With advancements in NLP and machine learning, the task of manual analysis of patents can be automated. XLSCOUT has developed a noise filtering algorithm that can remove noise from result sets based on learning from previous data.

Introduction

Patents hold an abundance of information related to the advancements in a technology domain and a company’s strategy. Business strategies are decided based on this information, so it is very important to gather and extract the relevant information. The source for this information is in patents so extracting the correct set of patents related to the technology is of paramount importance.

Problem

Manually reading and extracting relevant patents related to a technology domain is time-consuming and requires a lot of manual grunt work. Different researchers reading the patents can have a different understanding of the technology concepts, leading to a noisy output.

Solution: XLSCOUT Noise Filtering Algorithm

At XLSCOUT, we have experience in patent research, and we have combined that with our technical expertise in NLP and machine learning to develop a noise filtering algorithm that can learn from previous data and remove noise from new data sets.

With this algorithm, we can train a model for a technology domain using previous (historic) data sets of relevant and noise patents in the domain. The algorithm learns from this data, and once it is ready, it can be used to predict relevant patents from future datasets related to the same technology domain and reduce noise in the data set to the minimum.

Technology

For developing the noise filtering algorithm, we have used BERT (as the NLP model). We have fine-tuned the standard BERT model by feeding it patent data so that the model can understand the concepts and the semantics in patents. The trained BERT model is then used to transform the patent text into a vector representation that can be understood by the machine.

The second part of developing the algorithm is training a machine learning model. The model is trained by feeding it a labeled data set related to a domain. A labeled dataset corresponds to a set of patent documents that are labeled as relevant and not-relevant (noise) for a particular domain. The model creates associations between patent documents in the relevant set and those in the not-relevant set. This allows the model to distinguish between relevant and non-relevant patents by learning and identifying the important concepts in each.

Approach

The setup of the noise filtering algorithm has the following steps:

Data Collection

Labeled dataset related to a technology domain is curated.

Data Structuring

The dataset is split into two parts: training and test data (usually in the ratio of 80:20). The labels are removed from the test data for output validation.

Training of Machine Learning model

The labeled dataset is first transformed into a vector representation and then fed to the machine learning model for training.

Output Validation and Feedback

Once the model is trained, it is then used to predict relevant patents from the test data. False predictions are fed back to the model as feedback, allowing it to learn and improve its understanding. Multiple iterations are done to ensure that the output is correct and that the model can capture all the relevant patents from the data set.

The most important parameter to consider is the accuracy of the algorithm. 100% accuracy implies that all noise patents are removed and only relevant patents are extracted. No algorithm can be completely accurate. However, by using our approach, we can ensure that no relevant patents are excluded from the final predicted data set. At the same time, there will be some noise patents (significantly less) in the final data set. We have set up this algorithm for multiple clients, and they have validated that the noise is reduced by 85–90%.

Use Cases

Precise Technology Tracking
An Algorithm assists in extracting relevant patents to precisely understand the technology domain.
Precise Competitor Tracking
Competitors patents can be extracted and precisely segmented according to the technology sector to remove noise and get accurate insights into competitor strategies.
Accurate Landscape Insights
Extracting relevant results and reducing noise further helps to gain accurate insights from the landscape searches.

Conclusion

Manual analysis is a thing of the past, and developing trustworthy applications using NLP and machine learning is the need of the hour. Automated solutions can assist the manual research teams and make their lives much easier.

With our experience in this field and the feedback that we have received, we have seen that the XLSCOUT noise filtering algorithm saves a lot of time and manual effort. The saved time can then be utilized to innovate and improve the technology.

Why stay behind? Get in touch with us!

Schedule a Demo

TAGS:

General