Data, as we know, is a collection of facts or information. Historically, data has been collected in many forms. After the advent of technology, data now is mostly found in PDFs and other related formats. We may have data stored in tables, but eventually, when it is to be published, it needs to take shape in some document form. These documents can be difficult for organizations to understand and often don’t allow for the proper analysis of the data.
Recent technological advancements however, have made it easier to assemble data so that it can be analyzed more effectively. Unfortunately, there is not one single product on the market that is an off-the-shelf solution for this problem. Instead, one has to use a varied and diverse combination of products and custom solutions to achieve the business objective of storing this data in some relational model, so that proper analysis can become possible and quick.
When selecting these products one must keep in mind the following challenges so that they create the best customized solution for their needs.
- Data Sourcing: This section deals in collection capabilities of the tool or the solution being chosen. One must consider how easy it is to make the implementation happen, how tools are collecting data, and whether or not it will allow crawling authenticated sites along with the public sites or only crawls public sites. Even if the tool has the provision to crawl, certain vendors may block anonymous backdoor programs to walk into their property and start crawling. If the security system in the vendor servers is stringent, even if we try to mimic it with the correct user credentials and try to make an entry through the back-door, the requests will still get declined, because of central authentication servers at vendor sites. You must ask yourself if you have an agreement in place with vendors having stringent authentication security models for allowing our requests to go through.
- What to crawl and what not to crawl: For data collection, it is one of the biggest challenges to figure out what to crawl and what not to crawl. Business users have to make a crystal clear distinction between what information of data is needed and what type of documents they would be available in. As vendors are going to post any kind of documents in their repositories, it is essential to see which ones are needed.
- Data Extraction: This is again one of the most challenging areas of extracting data points from the document and building an association of the data point. This delivers an accelerated business value to the organization post data analysis. Most organizations falter in making an informed decision in this area. There are no off-the-shelf products available. One of the products, which I have worked with is the Apache PDF box. This product is able to read the pdf cell by cell as it identifies the information that is residing in a cell. After the information is extracted, a custom application needs to build a proper relational model, which builds the relationship for the data point extracted with its taxonomy structure. Data point mapping with respect to its context after the extraction, is a huge challenge. The data in a pdf may have a layout which seems fine to the naked eye, but when read from the pdf using a PDF Box, the true storage pattern can be obtained.
- Data Format during PDF creation: One has to keep in mind how PDFs are created. This is one of the challenging areas, when different data sources have been used to create the PDF documents. Your products and custom solutions might even have to be enhanced depending on how the PDF was generated. Sometimes PDFs are generated for distribution using Excel sheets, other times they could come from Word documents or table sources. Such diverse sources for creating distribution documents results in additional format cells coming into the PDF. Sometimes the data can also be in an image format.
- Building a rectification tool: Data points are extracted through a variety of process, therefore one needs a tool to rectify the data manually by the data analysts. This is especially important because data that is extracted incorrectly or wrongly associated, will not convey any meaning to the business users. Extracting such huge varieties of data points and then pulling them together with a tool can expose a big challenge in terms of performance, since you have to map the relational model for each data point.