文字实例——饿了么定点商圈抓取

0 讨 论 387 浏览量 云采集引擎

饿了么定点商圈抓取实例
1.抓取网站

饿了么官网:https://www.ele.me/home/

2.入口配置

打开之后需要先确定抓取的地区,输入地区点击搜索按钮获取搜索的链接

这里以成都国际金融中心(IFS)为例:https://www.ele.me/place/wm6n2kt007k

直接抓取上面这个链接发现没有商铺的数据,商铺数据通过json加载
然后用浏览器开发者工具找到json链接

https://mainsite-restapi.ele.me/shopping/restaurants?extras%5B%5D=activities&geohash=wm6n2kt007k&latitude=30.65462&limit=24&longitude=104.08037&offset=0&terminal=web
https://mainsite-restapi.ele.me/shopping/restaurants?extras%5B%5D=activities&geohash=wm6n2kt007k&latitude=30.65462&limit=24&longitude=104.08037&offset=24&terminal=web
https://mainsite-restapi.ele.me/shopping/restaurants?extras%5B%5D=activities&geohash=wm6n2kt007k&latitude=30.65462&limit=24&longitude=104.08037&offset=48&terminal=web

可以看出offset=这一部分表示翻页,每页24条商铺
所以配置入口使用自动增长,设定0-960条(饿了么有翻页限制,大约660左右最多,这里为了防止特例,设定多一点),每次增长24

3.链接配置

通过上面json加载到的链接可以获得商铺的大多数基本信息(不包含商品和单品价格等)
所以这里就不要在进行详细页面的下载了,在这里可以将一个商铺的所有信息作为参数传递到下一步,这样数据字段就不需要在下载饿了么页面了
由于json返回的固定模式,我们直接通过开始的字符串将每个json内容取出来

配置url识别规则:"address":"[url]},{"activities"

单条商铺信息如下:

成都市锦江区东大街上东大街段133号","average_cost":"¥41/人","description":"","distance":685,"float_delivery_fee":9,"float_minimum_order_amount":0···"name":"成都麦当劳壹购广场餐厅","next_business_time":"10:30","only_use_poi":false,"opening_hours":["00:00/04:45","05:00/10:15","10:30/23:55"],"order_lead_time":38,"phone":"4000517117","piecewise_agent_fee":{"description":"配送费¥9","extra_fee":0,"is_extra":false,"rules":[{"fee":9,"price":0}],"tips":"配送费¥9"},"promotion_info":"送福利!4月每周五、周六、周日,下单即送价值7元香芋派一份。","rating":4.7,"rating_count":333,"recent_order_num":1781,"recommend":{"image_hash":"ff085f835038a87ae040a8cd53338f4cjpeg","track":"{\"rankType\":\"3\"}"},"regular_customer_count":0,"status":1,"supports":[{"description":"该商家支持开发票,请在下单时填写好发票抬头","icon_color":"999999","icon_name":"票","id":4,"name":"开发票"}],"type":0

因为是单页抓取多条信息,所以需要配置前缀,#之后就是上面的内容,回作为参数传递

补充链接前缀:[entryurl]#

链接生成如下:

https://mainsite-restapi.ele.me/shopping/restaurants?···:{"color":"57A9FF","id":1,"is_solid":true,"text":"蜂鸟专送"},"description":" 食其家(中环广场店)","distance":837,"float_delivery_fee":5,"float_minimum_order_amount":20,"id":530868,···"name":"食其家(中环广场店)","next_business_time":"","only_use_poi":false,"opening_hours":["10:00/21:30"],"order_lead_time":24,"phone":"028-62100322","piecewise_agent_fee":{"description":"配送费¥4","extra_fee":0,"is_extra":false,"rules":[{"fee":4,"price":20}],"tips":"配送费¥4"},"promotion_info":"春节期间饿了么运力不足,订餐请移步美团,百度外卖,谢谢您对本餐厅的厚爱!","rating":4.8,"rating_count":181,"recent_order_num":762,"recommend":{"image_hash":"ff085f835038a87ae040a8cd53338f4cjpeg","track":"{\"rankType\":\"3\"}"},"regular_customer_count":0,"status":1,"supports":[{"description":"已加入“外卖保”计划,食品安全有保障","icon_color":"999999","icon_name":"保","id":7,"name":"外卖保"},{"description":"该商家支持开发票,请在下单时填写好发票抬头","icon_color":"999999","icon_name":"票","id":4,"name":"开发票"},{"description":"准时必达,超时秒赔","icon_color":"57A9FF","icon_name":"准","id":9,"name":"准时达"}],"type":0

这里有个小技巧,如果单页面访问多次会访问可能会导致封IP,而我们这里又将所有需要的信息作为参数了,所以不需要在下载饿了么页面,这里可以用百度页面替换,防止被封IP

补充链接前缀:https://www.baidu.com/#

连接配置完成

4.数据字段

数据字段配置直接使用传递进来的参数,规则中调用[poundstr]通配符即可

识别规则:[poundstr]

然后通过自定义脚本正则匹配各个信息

自定义脚本样例:
<?php
function CUSTOMFUN($inputstr){
// TODO: Coding Here
//下面室正则匹配,这里获商铺名字
preg_match("#\"name\":\"([^']*?)\",#is",$inputstr,$a);
return $a[1];
}
echo CUSTOMFUN($PARASTR); // Not Allowed to Modify
?>

Finndy Copyright©2017 | Powered by Q2A

...