Giter Club home page Giter Club logo

top_500_enterprises's Introduction

处理思路

  1. 首先爬取公司名字和营收(百万元),可以看的style以及规整的tr

  2. 任意选中1个可以发现公司的名字在<a href>中间,且我们接下来要爬取的信息也要建立在超链接之上,所以这里将所有超链接的标签储存在list中:

list = soup.find(attrs={"style":'word-break:break-all'}).find_all('a')

这样接下来直接string就可以获取公司名字:

for item in list:
	item_name=item.string
  1. 然后通过寻找父母节点和兄弟节点确定营收:
item_revenue=item.find_parent().find_next_sibling().string
  • 一开始并没有选择在公司名字页面爬取信息,而是获取超链接后载新页面获取信息。但下一页的信息标签不规整,处理起来对象关系太复杂-->故最好一开始找规整的整理,会更为方便  
  1. 通过get方法获取超链接:
  • 此处需要处理相对路径
  • 且由于打开新的页面,需要将html转为soup

笔者在写的时候遇到了response的内容不能直接按照之前方式进行处理,亦或是用response.content返回为str没法使用方法

本质问题是新打开的页面重新处理问题:

        sub_url=item.get('href')
		sub=sub_url.lstrip('../../')
		new_url="主站网址"+sub
		new_html=request_url(new_url)
		text=BeautifulSoup(new_html,'lxml')

 5. 接下来的处理就相当规整了:class——tr——align:right,直接find_all 当成数组处理

    item_sub = text.find(class_='ui-table1 box-s1').find_all('tr')
	item_info=[]
	for tr in item_sub:
		item_info.append(tr.find(attrs={"align":'right'}).string)
	item_industry=item_info[2]
	item_location=item_info[3]
	item_number=item_info[4]
	item_website=item_info[5]
  1. 遇到的问题及解决:

(1)对于
的处理:非标准的标签先find_all获取数组,再".text"获取文本,之后用正则表达式选取数字出来就好了

def number_process(s): rr = re.split('(\d+)',s) dd = rr[1]+re.split(rr[1],s)[1]
return dd

    item_main = text.find(class_='ui-homerank box-s1').find_all('p')
	item_profit=number_process(item_main[1].text)

 (2)数字导出到excel是文本,且有逗号:

	item_revenue=item_revenue.replace(',','')

float()或int()转换就好

(3)导出格式为xlsx会打不开,转成xls即可

(4)收集的数据有中文:

def request_url(url): try: response = requests.get(url) if response.status_code == 200: return response.content except requests.RequestException: return None 将response.text 变为response.content即可

完整代码见下:

import requests from bs4 import BeautifulSoup import xlwt import re

def request_url(url): try: response = requests.get(url) if response.status_code == 200: return response.content except requests.RequestException: return None

def number_process(s): rr = re.split('(\d+)',s) dd = rr[1]+re.split(rr[1],s)[1]
return dd

book1=xlwt.Workbook(encoding='utf-8',style_compression=0)

sheet=book1.add_sheet('**企业500强',cell_overwrite_ok=True) sheet.write(0,0,'排名') sheet.write(0,1,'名称') sheet.write(0,2,'营收(百万元)') sheet.write(0,3,'利润(百万元)') sheet.write(0,4,'行业') sheet.write(0,5,'公司地址') sheet.write(0,6,'员工数') sheet.write(0,7,'网站')

n=1

def save_to_excel(soup): list = soup.find(attrs={"style":'word-break:break-all'}).find_all('a') for item in list: item_name=item.string item_revenue=item.find_parent().find_next_sibling().string item_revenue=item_revenue.replace(',','')

	sub_url=item.get('href')
	sub=sub_url.lstrip('../../')
	new_url="主站网址"+sub
	new_html=request_url(new_url)
	text=BeautifulSoup(new_html,'lxml')
	
	item_main = text.find(class_='ui-homerank box-s1').find_all('p')
	item_profit=number_process(item_main[1].text)
	item_profit=item_profit.replace(',','')
	item_sub = text.find(class_='ui-table1 box-s1').find_all('tr')
	item_info=[]
	for tr in item_sub:
		item_info.append(tr.find(attrs={"align":'right'}).string)
	item_industry=item_info[2]
	item_location=item_info[3]
	item_number=item_info[4]
	item_website=item_info[5]

	global n
	
	print(str(n) + ' | ' + item_name + ' | ' +item_revenue + ' | '  +item_profit + ' | ' + item_industry +' | ' + item_location +' | ' + item_number+' | ' + item_website )
	

	sheet.write(n, 0, str(n))
	sheet.write(n, 1, item_name)
	sheet.write(n, 2, float(item_revenue))
	sheet.write(n, 3, float(item_profit))
	sheet.write(n, 4, item_industry)
	sheet.write(n, 5, item_location)
	sheet.write(n, 6, item_number)
	sheet.write(n, 7, item_website)

	n = n + 1

url = '目标网址' html = request_url(url) soup = BeautifulSoup(html, 'lxml') save_to_excel(soup)

book1.save(u'文件名.xls') //整体框架参考:

python爬虫08 | 你的第二个爬虫,要过年了,爬取豆瓣最受欢迎的250部电影慢慢看

top_500_enterprises's People

Contributors

jackson-nzp avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.