Use Python to download TXT-format SEC filings on EDGAR (Part I)

We know that SEC makes company filings (e.g.,10-Ks, 10-Qs and 8-Ks) publicly available on EDGAR. The web search interface is convenient, but we may need to bulk download raw text filings. SEC provides an anonymous EDGAR FTP server to access raw text filings (Update on March 3, 2017: SEC closed the FTP server permanently on December 30, 2016. So the description about the FTP server might be not applicable thereafter. But the basic idea about the URLs to raw text filings remain unchanged). Usually, if we know the path or URL to a file on an FTP server, we can easily use an Internet browser or an FTP software to connect to the server and download the file. For example, if we navigate a bit on the EDGAR FTP server, we can find the path to the file “master.idx” as follows:

ftp://ftp.sec.gov/edgar/full-index/2015/QTR4/master.idx

Copy the path into an Internet browser or an FTP software, we can download the file directly.

In the above example, we can find the path to “master.idx” by navigating on the EDGAR FTP server. But we cannot find any path to any raw text filing. In other words, paths to raw text filings are not visible by simply looking into the EDGAR FTP server. SEC purposely hides paths to raw text filings to reduce server load and avoid data abuse.

In order to download SEC filings on EDGAR, we have to:

  1. Find paths to raw text filings;
  2. Select what we want and bulk download raw text filings from the EDGAR FTP server using paths we have obtained in the first step.

This post describes the first step, and I elaborate the second step in another post.

SEC stores all path information in index files. See technical details here. Let’s take a snap shot of an index file:

The last field on a line in the main body of the index file shows the path to a real raw text filing. What we have to do in the first step is to download and parse all index files and write the content into a database. Then in the second step, we can execute any query into the database (e.g., select certain form type or certain period of time) and download raw text filings using selected paths.

I write the following Python program to execute the first step. This program borrows from Edouard Swiac’s Python module “python-edgar” (version: 1.0). Please see his package information page here.

Please note: my program stores all paths in an SQLite database. I personally like the lightweight database product very much. The last few lines of my program transfer data from the SQLite database to an Stata dataset for users who are not familiar with SQLite. To do so, I use two Python modules: pandas and sqlalchemy which you have to install using pip command on your own. Please google documentations of SQLite, Pandas, and SQLAchemy if you have installation problems. I am using Python 3.x in all my Python posts.

Update on March 3, 2017: SEC closed the FTP server permanently on December 30, 2016 and started to use a more secure transmission protocol—https. Since then I have received several requests to update the script. Here it is the new script for Part I.

I find two articles explain how to use R and Perl to achieve the same functionality. I include the links (R or Perl) for users who are more comfortable with R or Perl. Also, the technical details may be too boring to most people. So I provide the Stata dataset for download (size: 3.6GB) which includes all index files from 1993 Q1 to March 3, 2017.

This entry was posted in Data, Python. Bookmark the permalink.

47 Responses to Use Python to download TXT-format SEC filings on EDGAR (Part I)

  1. Alexandra says:

    Hi Kai, I have found your page while checking for ways to download SEC EDGAR files. I am very new to this topic and was wondering if you could help me with some questions I have regarding the use of Python for such a massive download? I would be happy to send you an email, but please let me know if you prefer me to ask you as a comment related to your post instead. Thanks!

  2. Cliff says:

    Thanks for the data, but I have a hard time to extract this file (around 395.95MB). The upzipped file size seems to be about 1.7GB, rather than 19GB. And I also have an error when the extracting process is close to 100%. Any thoughts?

    P.S. I use 7-z to extract it.

  3. Cliff says:

    Hi Kai,

    Thank you! I just saw the reply. (I thought I would receive an email from you when I got a reply:p)
    I found another website also includes the Edgar masterfile (http://www.wrds.us/index.php/repository/view/25)
    When I compare that one with yours, basically the two are quite similar, almost the same amount of observations. One thing I notice is that yours have ID, and the length of your variables is 255. Therefore their size is 1.9 GB, yours is 19 GB. Good job anyway!

  4. Kal Alsabah says:

    Hi Kai,

    Thanks for sharing this post. I’m facing an error when I run it with Python 2.7 (using Windows) :

    AttributeError: ‘module’ object has no attribute ‘FTP_ADDR’

    I tried importing edgar.FTP_ADDR but i get an import error:

    ImportError: No module named FTP

    Any idea how to resolve it.

    Best,

    Kal

  5. Scott says:

    This is great, but when i run the code I get “sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.”

    Is there some way to covert this to utf-8?
    Thanks,

  6. svet says:

    are 13F filings included in this file? Thanks!

  7. grateful guy says:

    Wow, this was really helpful. It took me a while to figure out
    [Errno 2] No such file or directory

    I had to make sure that the directory existed.
    http://stackoverflow.com/questions/2401628/open-file-in-w-mode-ioerror-errno-2-no-such-file-or-directory

  8. John White says:

    Excellent work. We are a boutique Investment Bank based in Chicago and Dallas. Specializing in marketing private offerings publicly via the JOBS Act/SEC guidelines. http://www.cassonmediagroup.com

    One of the tools we are working on is providing free, user-friendly access to all Private Offering filed (Form D) plus all offerings that have been posted on the “Equity Crowdfunding” sites (that have not yet filed): a manual aggregation project.

    Our thinking at this point is to not only extract via FTP, etc. all the new Form D filings every day plus attached docs. But to build a “web widget” app that can be embedded on any website (similar to stock quote widgets) giving the user a fast and painless way to access/search private offerings.

    Also, we have a mobile app (iPad, iPhone, IOS, Android) that would access the data.
    At no point do plan to monetize access, all would be Open Source and free. We would ask name and email to access to use the app.

    We would deeply appreciate any assistance that you could provide. Paid consulting gig, a 30-minute phone conference, design review, anything!

    I look forward to hearing from you.

    Thanks in advance,

    John White
    847-867-5911

  9. Tayyaba says:

    Hi Kai Chen,
    Great post. I was wondering where I can find the part II of this post as I want only 10-k and 10-q forms. Secondly, I’m able to create the ‘idx’ table, but somehow it’s not getting inserted with the values. Is it because, I always end up with an error?

    Thanks in advance,
    Tayyaba

  10. Eva Lee says:

    hi Kai,

    Great job. May I know why there is a loop i from 0 to 9 in the following code?
    with zipfile.ZipFile(temp).open(‘master.idx’) as z:
    for i in range(10):
    z.readline()

    Thank you!

  11. Kwan says:

    Hello Kai,

    Thanks so much for your posting. It helps me a lot. I was thinking about doing the same thing but do not have enough skills to write a python scripts like this. I appreciate your effort.
    I emulated the script on Python 3.5.2 and was able to create a sqlite db file. But I get erros when I try to export it as dta file. So I just also downloaded the dta file but want to know what is wrong. Below is the error message.

    —————————————————————————
    OperationalError Traceback (most recent call last)
    C:\Anaconda3\lib\site-packages\sqlalchemy\engine\result.py in fetchall(self)
    963 try:
    –> 964 l = self.process_rows(self._fetchall_impl())
    965 self._soft_close()

    C:\Anaconda3\lib\site-packages\sqlalchemy\engine\result.py in _fetchall_impl(self)
    914 try:
    –> 915 return self.cursor.fetchall()
    916 except AttributeError:

    OperationalError:

    The above exception was the direct cause of the following exception:

    OperationalError Traceback (most recent call last)
    in ()
    5 engine = create_engine(‘sqlite:///edgar_idx.db’)
    6 with engine.connect() as conn, conn.begin():
    —-> 7 data = pandas.read_sql_table(‘idx’, conn)
    8 data.to_stata(‘edgar_idx.dta’)

    C:\Anaconda3\lib\site-packages\pandas\io\sql.py in read_sql_table(table_name, con, schema, index_col, coerce_float, parse_dates, columns, chunksize)
    362 table = pandas_sql.read_table(
    363 table_name, index_col=index_col, coerce_float=coerce_float,
    –> 364 parse_dates=parse_dates, columns=columns, chunksize=chunksize)
    365
    366 if table is not None:

    C:\Anaconda3\lib\site-packages\pandas\io\sql.py in read_table(self, table_name, index_col, coerce_float, parse_dates, columns, schema, chunksize)
    1128 return table.read(coerce_float=coerce_float,
    1129 parse_dates=parse_dates, columns=columns,
    -> 1130 chunksize=chunksize)
    1131
    1132 @staticmethod

    C:\Anaconda3\lib\site-packages\pandas\io\sql.py in read(self, coerce_float, parse_dates, columns, chunksize)
    809 parse_dates=parse_dates)
    810 else:
    –> 811 data = result.fetchall()
    812 self.frame = DataFrame.from_records(
    813 data, columns=column_names, coerce_float=coerce_float)

    C:\Anaconda3\lib\site-packages\sqlalchemy\engine\result.py in fetchall(self)
    968 self.connection._handle_dbapi_exception(
    969 e, None, None,
    –> 970 self.cursor, self.context)
    971
    972 def fetchmany(self, size=None):

    C:\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py in _handle_dbapi_exception(self, e, statement, parameters, cursor, context)
    1339 util.raise_from_cause(
    1340 sqlalchemy_exception,
    -> 1341 exc_info
    1342 )
    1343 else:

    C:\Anaconda3\lib\site-packages\sqlalchemy\util\compat.py in raise_from_cause(exception, exc_info)
    200 exc_type, exc_value, exc_tb = exc_info
    201 cause = exc_value if exc_value is not exception else None
    –> 202 reraise(type(exception), exception, tb=exc_tb, cause=cause)
    203
    204 if py3k:

    C:\Anaconda3\lib\site-packages\sqlalchemy\util\compat.py in reraise(tp, value, tb, cause)
    183 value.__cause__ = cause
    184 if value.__traceback__ is not tb:
    –> 185 raise value.with_traceback(tb)
    186 raise value
    187

    C:\Anaconda3\lib\site-packages\sqlalchemy\engine\result.py in fetchall(self)
    962
    963 try:
    –> 964 l = self.process_rows(self._fetchall_impl())
    965 self._soft_close()
    966 return l

    C:\Anaconda3\lib\site-packages\sqlalchemy\engine\result.py in _fetchall_impl(self)
    913 def _fetchall_impl(self):
    914 try:
    –> 915 return self.cursor.fetchall()
    916 except AttributeError:
    917 return self._non_result([])

    OperationalError: (sqlite3.OperationalError)

    • Kai Chen says:

      Hi Kwan,

      I don’t know the exact cause based on the error log, but I guess it may be related to the relative and absolute path. Try this: first, check the file size of “edgar_idx.db” you got. If it is a sizeable file, the problem should exist in the conversion of Sqlite to Stata; second, use the absolute path in the conversion step: engine = create_engine(r’sqlite:///C:\path\to\edgar_idx.db’). Replace ‘C:\path\to\’ with your own absolute path.

      Let me know if this works.

      Kai

      • Kwan says:

        Hello Kai,

        I spent some time reverse engineering and revising your script to suit my need. And your suggestion worked!! Thanks so much!!

        • Kwan says:

          I was so excited when python was running but unfortunately, at the end, it gave an operationalError. would it be due to the file size, the first possibility you mentioned above. Could you explain how to deal with big db files?

          • Kai Chen says:

            I didn’t try my codes on a Windows machine. The first step is to make sure you have generated a Sqlite database correctly. If the database has gathered the data as it is supposed to have, the size of the database should be about 2G. The first step helps you to decide at which stage the error occurs. If you can get the correct database file but just cannot export the data to another software, you can instead Google other exporting methods (there are tons of other methods).

  12. Kwan says:

    Thanks for your reply, Kai. I have encountered errors a few times before running scripts written on Mac. It is more than a backslash or front slash issue but could not pinpoint the cause. I will figure it out and get back to you when I solve the issue. Thanks!

  13. agosta says:

    Thank you Kai for sharing this tutorial – after tweaking your code to better suit my environment it worked on the first try! I do, however, see something odd about the data returned. Although my sqlite db is about 2gb as you said it should be (it’s actually 1.7gb with another table in it), I find that I was only able to index ~16.4M filings when I ran the script today.

    However, this article written back in 2014 (http://tinyurl.com/jupr3zy), indicates that there should’ve been at least 25M filings. I’ve seen various other articles that note there should be at least 20M.

    Do you have any idea where the other ~5M to ~10M indexes are?

    Thanks again for sharing!

  14. erm3nda says:

    For people using that, ftp will shutdown dec ’16. It is publised on their site. https://www.sec.gov/edgar/searchedgar/ftpusers.htm

  15. David says:

    Does this return all the 10k for every company in the russell3000?

    Would like to know what does it extract. I had data base programming using SQLite to extracts tweets into database. I would like to know what does this do in specific?

  16. David says:

    So I get the name of the company, ticker and path. Then how can I use the path to iterate over the text file?

  17. Manuel says:

    UNfortunately, EDGAR access will be shutdown by the end of the year 🙁
    https://www.sec.gov/edgar/searchedgar/ftpusers.htm

  18. HDT says:

    Regarding the SEC shutting down the FTP server on 12/31:

    Replacing the FTP bits in the script with urllib2 and using https://www.sec.gov/Archives/edgar/full-index etc. as the base will return exactly the same data, and the rest of the script should work accordingly.

  19. Tim says:

    Hi HDT,

    I would also appreciate it if you could reply. Specifically if you could expand on what exactly entails replacing the FTP bits in the script with urllib2.

    Thank you!

  20. Xiang says:

    Thanks for your code! I am using your code to successfully download proxy statement in txt format. Since the htm file is better formatted, I try to revise your code to download crawler.idx. But I am stuck in the code:
    cur.executemany(‘INSERT INTO idx VALUES (?, ?, ?, ?, ?)’, records),
    as I don’t how to set line.split for crawler.idx. Can you help to revise the code? Thanks in advance!

    • Kai Chen says:

      Please substitute the “records = ” line with the following several lines:

      nameloc = lines[7].find(‘Company Name’)
      typeloc = lines[7].find(‘Form Type’)
      cikloc = lines[7].find(‘CIK’)
      dateloc = lines[7].find(‘Date Filed’)
      urlloc = lines[7].find(‘URL’)
      records = [tuple([line[:typeloc].strip(), line[typeloc:cikloc].strip(), line[cikloc:dateloc].strip(), line[dateloc:urlloc].strip(), line[urlloc:].strip()]) for line in lines[9:]]

      But note the URL you obtain is not the URL to an html version Form (see one example of such url https://www.sec.gov/Archives/edgar/data/859747/0001477932-16-007969-index.htm). You still have to go one level deeper, which I don’t know how to do yet.

  21. Yulin Chen says:

    Hi Kai

    That is really useful code that can be reused in many scenarios. I am trying to create a database which contains detail information from 13F-HR. Now i did it by getting all raw txt/xml data based on your code 🙂

    Now something frustrate(little bit..) me is … looks like there are all raw txt file for 13F until 2013 3q when xml available

    Although it s quite ease to parse xml by python elementTree, i still struggling in how to deal with those txt data before 2013 3q.

    Any constructive thoughts ?

    Thank you very much

    By the way, i am also in Ontario CA 🙂

    • Kai Chen says:

      My current knowledge about text data processing is also limited. To start with, you can learn regular expressions. That could resolve many questions. The high-level applications needs the knowledge of natural language processing using Python (or other programming languages). It’s technically intimidating.

  22. Irem says:

    Hi Ken, thanks for your post. I am just wondering whether you know if it is possible to automate an online download with Stata. This is the only program that I am comfortable using but never wrote such a code before.

Leave a Reply

Your email address will not be published. Required fields are marked *