Use Python to extract Intelligence Indexing fields in Factiva articles

First of all, I acknowledge that I benefit a lot from Neal Caren’s blog post Cleaning up LexisNexis Files. Thanks Neal.

Factiva (as well as LexisNexis Academic) is a comprehensive repository of newspapers, magazines, and other news articles. I first describe the data elements of a Factiva news article. Then I explain the steps to extract those data elements and write them into a more machine-readable table using Python.

Data Elements in Factiva Article

Each news article in Factiva, no matter how it looks like, contains a number of data elements. In Factiva’s terminology, those data elements are called Intelligence Indexing Fields. The following table lists the label and name for each data element (or, field) along with what is contained in each:

Field LabelField NameWhat It Contains
HDHeadlineHeadline
CRCredit InformationCredit Information (Example: Associated Press)
WCWord CountNumber of words in document
PDPublication DatePublication Date
ETPublication TimePublication Time
SNSource NameSource Name
SCSource CodeSource Code
EDEditionEdition of publication (Example: Final)
PGPagePage on which article appeared (Note: Page-One Story is a Dow Jones Intelligent Indexingª term)
LALanguageLanguage in which the document is written
CYCopyrightCopyright
LPLead ParagraphFirst two paragraphs of an article
TDTextText following the lead paragraphs
CTContactContact name to obtain additional information
RFReferenceNotes associated with a document
CODow Jones Ticker SymbolDow Jones Ticker Symbol
INIndustry CodeDow Jones Intelligent Indexingª Industry Code
NSSubject CodeDow Jones Intelligent Indexingª Subject Code
RERegion CodeDow Jones Intelligent Indexingª Region Code
IPCInformation Provider CodeInformation Provider Code
IPDInformation Provider DescriptorsInformation Provider Descriptors
PUBPublisher NamePublisher of information
ANAccession NumberUnique Factiva.com identification number assigned to each document

Please note that not every news article contains all those data elements, and that the table may not list all data elements used by Factiva (Factiva may make updates). Depending on which display option you select when downloading news articles from Factiva, you may not be able to see certain data elements. But they are there and used by Factiva to organize and structure its proprietary news article data.

How to Extract Data Elements in Factiva Article

flow

You can follow three steps outlined in the above diagram to extract data elements in news articles and for further processing (e.g., calculate tone of full text represented by both LP and TD element; or group by news subject, i.e., by NS element). I explain them one by one as follows.

Step 1: Download Articles from Factiva in RTF Format

It is a lot of pain to download a large number of news articles from Factiva: it is technically difficult to download articles in an automated fashion; you can only download 100 articles at a time, also those 100 articles cannot exceed the word count limit, i.e., 180,000. As a result, it requires a lot of tedious work if you want to gather tens of thousands news articles. While I can do nothing about both issues in this post, I can say a bit more about them.

Firstly, you may see some people discuss methods for automatic downloading (a so-called “webscraping” technique. See here). However, this needs more hacking after Factiva introduced CAPTCHA to determine whether or not the user is a human. You may not be familiar with the term “CAPTCHA”, but you must experience the circumstance where you are asked to input characters or numbers shown in an image before you can download a file or go to the next webpage. That is CAPTCHA. Both Factiva and LexisNexis Academic have introduced CAPTCHA to prohibit robotic downloading. Though CAPTCHA is not unbeatable, it requires advanced technique.

Secondly, the Factiva licence expressly prohibits data mining. However, the licence does not define clearly what constitutes data mining. I was informed that downloading a large number of articles in a short period of time would be red flagged as data mining. But the threshold speed set by Factiva is low and any trained and adept person can beat that threshold speed easily. If you are red flagged by Factiva, things could go ugly. So, do not be too fast, even this may slow down your research.

Let’s get back to the topic. When you manually download news articles from Factiva, the most important thing is to select the right display option. Please select the third one: Full Article/Report plus Indexing as indicated by the following graph:

Factiva

Then you have to download articles in RTF – Article Format, as indicated by the following graph:

Factiva2

After the download is completed, you will get an RTF document. If you open it, you will find news articles look like this:

Factiva3

The next step is to convert RTF to plain TXT, because Python can process TXT documents more easily. After Python finishes its job, the final product will be a table: each row of the table represents a news article; and each column of the table is a data element.

Step 2: Convert RTF to TXT

Well, this can surely be done by Python. But so far I have not written a Python program to do this. I will complete this “hole” when I have time. For my research, I simply take advantage of the convenience of the default text editor shipped with Mac OS, TextEdit. I select Format – Make Plain Text from the menu bar, and then save the document in TXT format. You can make this happen in an automatic fashion using Automator in Mac OS.

Step 3: Extract Data Elements and Save to a Table

This is where Python does the dirty work. To run the Python program correctly, please save the Python program in the directory where you put all plain TXT documents created in Step 2 before you run the program. This program will:

  1. Read in each TXT document;
  2. Extract data elements of each article and write them to an SQLite database;
  3. Export data to a CSV file for easy processing in other software such as Stata.

I introduce an intermediate step which writes data to an SQLite database, simply because this can facilitate manipulation of news article data using Python for other purposes. Of course, you can directly write data to a CSV file.

This entry was posted in Python. Bookmark the permalink.

5 Responses to Use Python to extract Intelligence Indexing fields in Factiva articles

  1. Nguyen says:

    Hi there,
    I am using your method to extract information from Factiva.
    However, the code has some problem that I cannot run it smoothly.
    I have the following error:
    UnicodeEncodeError: ‘ascii’ codec can’t encode character ‘\xa9’ in position 165: ordinal not in range(128)
    Could you please help me to solve it?

  2. Anna says:

    Hi Kai,

    I have also problems extracting information from Fictive with your code.
    I have the following error:

    return codecs.ascii_decode(input, self.errors)[0]
    UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xe2 in position 22: ordinal not in range(128)

    It seems there is also a problem with “parser(f)”.
    Could you please help me to solve it?
    Thank you!

  3. Joel Nothman says:

    With the HTML export rather than RTF, you can get a great representation of the data with this one-liner!

    import pandas as pd
    data = pd.concat([art for art in pd.read_html(‘/path/to/factiva-export.html’, index_col=0) if ‘HD’ in art.index.values], axis=1).T.set_index(‘AN’)

    You are welcome to then use data.to_sql() or data.to_csv()…

Leave a Reply

Your email address will not be published. Required fields are marked *