python - Unscriptable Int Error for String Slice -


I am writing a webcrawer and I have The link to the pdf files is full, which I downloaded, saved, and then analyzed later I was using beautiful soup and I used to get all the links of soup. They are usually beautiful soup tag objects, but I have changed them into wires. The string is actually a bunch of hard work which is buried in the middle of the hard text. I want to take out that garbage and just want to leave the link. Then I will turn these into a list and download the dragon later. (My plan is to keep a list of PDF link names to keep track of what has been downloaded, and then file names according to those links or according to a portion of it).

But the .pdfs variables come in the name-length, such as:

  • I_am_the_first_file.pdf
  • and _ I_am_the_seond_file.pdf < / Ul>

    And as they are present in the table, they have a bunch of junk text:

    • a href =: // blah /blah/blah/I_am_the_first_file.pdf [ Plus other annotation content
    • a href =: // blah / blah / blah / and_i_am_the_seond_file.pdf [plus other annotation content which comes in my strings by mistake]

      So I ("piece") want to cut the part of the front and the last part of the string And just leave the string that points to my URL (hence the desired output for my program is as follows):

      • : // blah / blah / blah / I_am_the_first_file.pdf
      • : // blah / blah / blah / and_i_am_the_seond_file.pdf

        As you can see , However, the second file contains more characters in the string than before. So I can not do that:

          string [9:40]   

        or whatever works for the first file, but not for the other .

        So I'm trying to come up with a variable for the end of the string piece, such as:

          string [9: x]   

        which has a space in the X string that ends in '.pdf' (and I thought it was to use the string.index ('.pdf') function.

          ("Type Error:" object can not be canceled)   

        It's probably easier to do this in addition to messing with the answers and strings A bay Here's the way, but you consider people to be better than me and I think you will know directly.

        Here's my complete code: Import urlib2 Beautiful soup page from imported beautiful soup page = urllib2.urlopen ("mywebsite.com") soup = beautiful soup (page) table_with_my_pdf_links = soup.find ('table', id = 'searchResults') # "Find results" just this The table I was looking for was called, in pdf_link, in table_with_my_pdf_links.findAll ('a'): #this says, to find all those links and looop Df_link_string = str (pdf_link) # Turn on the link in the wires (they are usually soup tag objects that do not help me, I know that) If 'pdf' contains pdf_link_string: table # there are some links. Html and I do not want them, I just want pdfs # I _ p_f_link_string.index ('.pdf') # I want to know where the .pdf file extension is ending, because this is the end of the link, so i Just_the_link = end_of_link [9: end_of_link] #here, the first 9 characters are junk "a href = yadda yadda yadda" so I'm setting a variable that starts after that junk and. PDF goes on (I know that actually do me .pdf + 3 or something, actually go to the end of the string, but it makes it easy for now) Print just_the_link # I Debug by Print Statement because I read an amatuer

        line (second from bottom): just_the_link = end_of_link [9: end_of_link]

        An error ( typeError: 'int' object is not aborted )

        Also, the "Hypertext Transfer Protocol Colon" should be, but it gives me 2/2 Can not post more links than, so I will not let them post.

          just_the_link = end_of_link [9: end_of_link]   

        This is your problem, like an error message says end_of_link is an integer - index of ".pdf" in pdf_link_string , which you have calculated in the previous row . So naturally you want to slice it up, you want to slice it to pdf_link_string .

Comments