Use Python to download SEC filings on EDGAR (Part II)

As I said in the post entitled “Part I“, we have to do two steps in order to download SEC filings on EDGAR:

  1. Find paths to raw text filings;
  2. Select what we want and bulk download from the EDGAR FTP server using paths we have obtained in the first step.

Part I” elaborates the first step. This post shares Python codes for the second step.

In the first step, I save index files in a SQLite database as well as a Stata dataset. The index database includes all types of filings (e.g., 10-K and 10-Q). Select from the database the types that you want and export your selections into a CSV file, say “sample.csv”. To use the following Python codes, the format of the CSV file must look like this (this example selects all 10-Ks of Apple Inc):

Then we can let Python complete the bulk download task:

I do not take care of file directories of “sample.csv” and output raw text filings in the codes. You can modify by yourself. saveas = '-'.join([line[0], line[2], line[3]]) is used to name the output SEC filings. The current name is cik-form type-filing date.txt. Please move around these elements to accommodate your needs (thank Eva for letting me know a previous error here).

Update on March 3, 2017: SEC closed the FTP server permanently on December 30, 2016 and started to use a more secure transmission protocol—https. Since then I have received several requests to update the script. Here it is the new codes for Part II.

 

This entry was posted in Data, Python. Bookmark the permalink.

32 Responses to Use Python to download SEC filings on EDGAR (Part II)

  1. Kostas says:

    Really enjoyed the posts relating to EDGAR.

    Would be really nice if you can do another part focusing on how to clean and prepare SEC fillings for textual analysis, or even how to extract particular items from 10K’s for example.

  2. Frank V says:

    Awesome Post!

    Is there a quick way to modify the python script so that it is writing the path to the xml infotable for each company? For instance, instead of outputting the following to path in the db:

    “edgar/data/949623/0000949623-16-000014.txt”

    The script would output the following:

    “edgar/data/949623/000094962316000014/infotable.xml”

    Thanks again!

  3. David says:

    Do I have to use strata to compile my CSV file? or can I use R, SAS or even python to do that step? Can anyone help plz as long as I have the paths?

  4. David says:

    Is it possible to compile only 1 section of each report like example 7A?

  5. CL says:

    Thank you for sharing this knowledge!

    Part 1 I executed without problem but when running Part 2 I ran into the error message ‘No such file or directory: ‘…”. I’m not certain how to troubleshoot this. If I just use the code as is, should I be expecting any sort of problem?

  6. Thomas says:

    Excellent post! Thank you for sharing this information, it truly is invaluable.

    • William L says:

      Hi Kai,

      thank you very much for sharing these codes. Unfortunately, the part2 code doesnt work for my python 2.7.13:

      I get following error messages:
      File “E:\William\Download_Edgar.py”, line 10, in
      with open(‘edgar_idx_sort_final.csv’, newline=”) as csvfile:
      TypeError: ‘newline’ is an invalid keyword argument for this function

      If I successfully circumvent the above error: I get following error messages:
      Traceback (most recent call last):
      File “E:\William\Download_Edgar.py”, line 18, in
      http://ftp.retrbinary(‘RETR %s’ % path, f.write)
      File “E:\William\Python\lib\ftplib.py”, line 414, in retrbinary
      conn = self.transfercmd(cmd, rest)
      File “E:\William\Python\lib\ftplib.py”, line 376, in transfercmd
      return self.ntransfercmd(cmd, rest)[0]
      File “E:\William\Python\lib\ftplib.py”, line 339, in ntransfercmd
      resp = self.sendcmd(cmd)
      File “E:\William\Python\lib\ftplib.py”, line 249, in sendcmd
      return self.getresp()
      File “E:\William\Python\lib\ftplib.py”, line 224, in getresp
      raise error_perm, resp
      error_perm: 550 path: No such file or directory

      Would be great if you can give me a hint for the second error messages. Thank you and happy new year!

      Best
      William

  7. Greg says:

    Do you know how this works with the SEC turning off the FTP services? Thanks so much for the help!

  8. Liyan says:

    Hi Kai, thanks so much for sharing these great codes. I am a python newbie and trying to modify your code to suit the new SEC website post the FTP server shutdown. I having trying to replace ftplib with either httplib2 or urllib2. But it is not exactly working out, specifically I am having difficulties saving it as a file. Would you be kind enough to help out? Thank you!

    from __future__ import print_function

    import csv
    import httplib2

    h = httplib2.Http(‘www.sec.gov’)

    with open(‘sample.csv’, newline=”) as csvfile:
    reader = csv.reader(csvfile, delimiter=’,’)
    for line in reader:
    saveas = ‘-‘.join([line[0], line[1], line[2], line[3]])
    # Reorganize to rename the output filename.
    path = line[4].strip()
    with open(saveas, ‘wb’) as f:

  9. Claire says:

    Thanks for the code! Is there anything wrong 2011 Q4? I get the files for every quarter except this one.

  10. Eva says:

    Hi Chen,
    I use your python 3 code to download sec 10-K file. However the following error showed up:
    —————————————————————————
    FileNotFoundError Traceback (most recent call last)
    in ()
    8 # Reorganize to rename the output filename.
    9 url = ‘https://www.sec.gov/Archives/’ + line[4].strip()
    —> 10 with open(saveas, ‘wb’) as f:
    11 f.write(requests.get(‘%s’ % url).content)
    12 print(url, ‘downloaded and wrote to text file’)

    FileNotFoundError: [Errno 2] No such file or directory: ‘147-101320-UJB FINANCIAL CORP /NJ/-10-K’

    This is after successful download of a couple of txt files. Is it possibly because the url has changed or updated?

    • Kai Chen says:

      Thanks for letting me know the bug. This is because the company name contains special characters that are not allowed to be used in a text file name. In this code line: saveas = ‘-‘.join([line[0], line[1], line[2], line[3]]), line[0] represents the first element in one line/record of the CSV file (in your case it is 147); line[1] represents the second element (in your case, 101320), and so on. To solve the problem, remove the company name element (in you case, line[2]). I would suggest you change the code line to: saveas = ‘-‘.join([line[0], line[1], line[3], line[4]]).

  11. João Lago says:

    Dear Kai,

    First, thank you very much for your codes. I have been trying to use your updated script using Python 3.0. However, I get the following error when I run the line “import request”:
    “ImportError: No module named request ”
    I tried installing the package but i am having some difficulties with it. Would you know how to solve this issue for Python 3.0. ?

    In terms of the information that I need to extract, I just need the date of the Filing, the CUSIP of the targeted firm and the Percentage of ownership recorded (if this helps on solving the issue).

    Thank you very much for your work.

    Best regards,

    João Lago

    • Kai Chen says:

      The Request module documentation or Google would be a better resource for troubleshooting. If you want to extract specific information from text filings, my current posts cannot help you. To achieve that, you should learn something called “regular expression”. Data collection is a huge cost for research. If you only need CUSIP and percentage of ownership (what ownership?), are you sure extracting them from raw text filings is the best way to go?

  12. Christian Ramos says:

    Dear Kai,

    I had a similar problem to Eva, but I couldn’t fix it myself. Below is a summary of the
    code and the error:

    import csv
    import requests
    with open(‘sample.csv’, newline=”) as csvfile:
    reader = csv.reader(csvfile, delimiter=’,’)
    for line in reader:
    saveas = ‘-‘.join([line[0], line[2], line[3]])
    url = ‘https://www.sec.gov/Archives/’ + line[4].strip()
    with open(saveas, ‘wb’) as f:
    f.write(requests.get(‘%s’ % url).content)
    print(url, ‘downloaded and wrote to text file’)

    5199
    https://www.sec.gov/Archives/edgar/data/1000045/0001436857-15-000014.txt downloaded and wrote to text file
    Traceback (most recent call last):
    File “”, line 6, in
    with open(saveas, ‘wb’) as f:
    FileNotFoundError: [Errno 2] No such file or directory: ‘1000045-SC 13D/A-2015-04-24’

    As you can see, the first filing was extracted, but the next one didn’t. How can I solve this ?

    Thank you very much.

    Kind regards,

    Christian Ramos

    • Kai Chen says:

      Hi Christian, this is because the forward slash is not allowed in the file name. I have updated my script and made it more error-proof. Please use the new script.

      • Christian Ramos says:

        Dear Kai,

        Thank you for your reply. The script works perfectly fine !

        Do you have any advice on how to compile this data into treatable excel file ?

        Thank you very much for your work.

        Kind regards,

        Christian Ramos

        • Kai Chen says:

          I am not sure what you want to do exactly. If you want to extract specific texts from those filings, that’s not what my current posts can help you. But to do that, you can turn to something called “regular expression”. Regular expression is independent of programming language. There is a misconception that only PERL can do text extractions. That’s deadly wrong. In my opinion, people will forget PERL once they appreciate the simplicity and readability of Python. Many programming languages, including Python, support regular expression pretty well nowadays. In accounting and finance research, many textual analysis tutorials are using PERL not because PERL is way better, but simply because PERL is older. That being said, writing good regular expression patterns is a work of art. I would suggest borrowing regular expression patterns from those PERL tutorials and then asking Python to do the rest.

  13. Tiago Ferreira says:

    Hello,
    Thank you so much for the post.

    I need to download also the 20f fillings,
    Any ideas of how to do that?

    Thanks

  14. Mostak Ahamed says:

    Hi Kai,
    Thanks for the Stata edgar_idx.dta.
    I am not proficient in python. Could I fetch text file from SEC using Stata while using the path idx?

    thanks

  15. Bei says:

    Hi Kai,

    Thank you so much for sharing the code! While I am fetching the file from SEC, I get the warning message. “HTTPError: HTTP Error 429: Too Many Requests”. Then I tried to browse SEC website, and it gave me the following message.

    “You’ve Exceeded the SEC’s Traffic Limit

    Your request rate has exceeded the SEC’s maximum allowable requests per second. Your access to SEC.gov will be limited for 10 minutes.

    Current guidelines limit each user to a total of no more than 10 requests per second, regardless of the number of machines used to submit requests. To ensure that SEC.gov remains available to all users, we reserve the right to block IP addresses that submit excessive requests.”

    Could I know whether you encounter this problem before? Do you have any suggestions to solve it? Thanks!

    • Kai Chen says:

      In this case, you have to pace your program, i.e., adding pause to slow down the program so that it will not run over the limit. You can import the time module and e.g. using time.sleep(25). This instructs the program to pause for 25 seconds.

  16. Rob says:

    Kai,

    I cannot get the newline operator to work properly. Thus, I incur an IndexError as my indexes are out of range. Could there be something wrong with my csv file?

    • Kai Chen says:

      I don’t know exactly what went wrong on your side. I’m using a Mac computer. If you’re using a Windows machine, Windows and Mac use different newline character. Maybe that’s the reason. Please check Python documentation.

Leave a Reply

Your email address will not be published. Required fields are marked *